Tibbles
Task 1:Loading the tidyverse package.
library(tidyverse)
Task 2:Converting the iris dataset to a tibble.
as_tibble(iris)
Task 3: Creating a tibble with columns “x,” “y,” and “z,” where “x”
ranges from 1 to 5, “y” is 1 for all rows, and “z” is calculated as the
square of “x” plus “y” for each row.
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
Task 4:Creating a tibble with columns named “:)” (representing
“smile”), ” ” (representing “space”), and “2000” (representing
“number”).
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
tb
Task 5:Creating a tibble with columns “x,” “y,” and “z,” containing
the values “a,” 2, 3.6 and “b,” 1, 8.5 respectively.
tribble(
~x, ~y, ~z,
"a", 2, 3.6,
"b", 1, 8.5
)
Tibbles vs. data.frame
Task-1:Creating a tibble with columns “a,” “b,” “c,” “d,” and “e,”
containing 1000 randomly generated values for each column, representing
dates, numbers, and letters.
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
Task 2: Tnstalling the package
package_to_install <- c("nycflights13")
for (package_name in package_to_install) {
if (!requireNamespace(package_name, quietly = TRUE)) {
install.packages(package_name)
}
}
library(nycflights13)
Task 3: Printing the first 10 rows of the nycflights13::flights
dataset with unlimited width.
nycflights13::flights %>%
print(n = 10, width = Inf)
Task 4: Viewing the nycflights13::flights dataset in a separate
window for interactive exploration.
nycflights13::flights %>%
View()
Subsetting
Task 1: Creating a tibble named “df” with columns “x” and “y,” then
accessing the “x” column using different methods:
df <- tibble(
x = runif(5),#function that generates random numbers from a uniform distribution
y = rnorm(5) # function that generates random numbers from a normal (Gaussian) distribution
)
df$x
[1] 0.4134781 0.3841133 0.5761670 0.6047906 0.8490257
df[["x"]]
[1] 0.4134781 0.3841133 0.5761670 0.6047906 0.8490257
df[[1]]
[1] 0.4134781 0.3841133 0.5761670 0.6047906 0.8490257
df %>% .$x
[1] 0.4134781 0.3841133 0.5761670 0.6047906 0.8490257
Interacting with older code
Task-1: Determining the class of the object “tb” after converting it
to a data frame.
class(as.data.frame(tb))
[1] "data.frame"
Exercises
Task-1: How can you tell if an object is a tibble? (Hint: try
printing mtcars, which is a regular data frame).
mtcars
Task-2
# In a data.frame, extracting a non-existent column returns NULL,
# whereas in a tibble, it raises an error, providing immediate feedback.
# Other operations, such as extracting existing columns and subsets of columns,
# behave similarly across both data frames and tibbles.
# The default behavior of data.frames may lead to frustration
# due to the lack of error feedback for non-existent columns,
# potentially causing unnoticed mistakes and difficulty in debugging.
# In contrast, tibbles offer more robust behavior, enhancing data integrity
# and debugging efficiency.
df <- data.frame(abc = 1, xyz = "a")
# Extracting non-existent column in a data.frame
df$x # Returns NULL
[1] "a"
# Extracting existing column in a data.frame
df[, "xyz"] # Returns a data frame with one column containing the values of the "xyz" column
[1] "a"
# Extracting multiple columns in a data.frame
df[, c("abc", "xyz")] # Returns a data frame containing only the specified columns
NA
Task-3:If you have the name of a variable stored in an object,
e.g. var <- “mpg”, how can you extract the reference variable from a
tibble?
No pacakages
# heights <- read_csv("data/heights.csv")
Task 1: listing several tables: table1, table2, table3, table4a, and
table4b.
table1
table2
table3
table4a
table4b
Task 2: Calculating the rate by dividing the number of cases by the
population and then multiplying by 10,000 for table1.
table1 %>%
mutate(rate = cases / population * 10000)
Task 3: Counting the occurrences of each year in table1, using the
‘cases’ column as the weight.
table1 %>%
count(year, wt = cases)
Task 4: Creating a ggplot using table1, plotting ‘year’ against
‘cases’ with lines grouped by ‘country’ and colored in grey50, along
with points colored by ‘country’.
library(ggplot2)
ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country))

Pivoting
Longer
Task-1: referring to ‘table4a’
table4a
Task-2: Reshaping table4a using pivot_longer for columns ‘1999’ and
‘2000’ into ‘year’ and ‘cases’.
table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
Task-3: Reshaping table4b with pivot_longer for columns ‘1999’ and
‘2000’ into ‘year’ and ‘population’.
table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population") #function transforms wide data into long format by stacking multiple columns into two: one for variable names and one for their corresponding values
Task-4: creating tidy datasets tidy4a and tidy4b by using
pivot_longer on table4a and table4b to reshape them. Then, performing a
left join on tidy4a and tidy4b.
tidy4a <- table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
left_join(tidy4a, tidy4b)
Joining with `by = join_by(country, year)`
Wider
Task-1:Displaying table 2
table2
Task-2: using the pivot_wider function on table2 to transform it from
long to wide format, with ‘type’ becoming the new column names and
‘count’ being the corresponding values.
table2 %>%
pivot_wider(names_from = type, values_from = count)
Separating and uniting
Separate
Task-1:displaying table3
table3
Task-2: Using the separate function on table3 splits the ‘rate’
column into two separate columns named ‘cases’ and ‘population’.
table3 %>%
separate(rate, into = c("cases", "population"))
Task-3:Using the separate function on table3 splits the ‘rate’ column
into two separate columns named ‘cases’ and ‘population’, using the ‘/’
character as the separator.
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
Task-4:Using the separate function on table3 splits the ‘rate’ column
into two separate columns named ‘cases’ and ‘population’, converting the
resulting columns to their appropriate data types.
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
Task-5: Applying the separate function to table3, the ‘year’ column
is divided into two separate columns labeled ‘century’ and ‘year’, with
the separator defined as the second character.
table3 %>%
separate(year, into = c("century", "year"), sep = 2)
Unite
Task-1: The unite function is applied to table5 to merge the
‘century’ and ‘year’ columns into a single column named ‘new’.
table5 %>%
unite(new, century, year)
Task-2: unite function is applied to table5 to merge the ‘century’
and ‘year’ columns into a single column named ‘new’, with no separator
between them.
table5 %>%
unite(new, century, year, sep = "")
Missing values
Task-1: Create a tibble named “stocks” with columns “year”, “qtr”
(quarter), and “return”, having data for 2015 and 2016, with quarterly
returns specified and some missing entries as NA.
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
Task-2:Pivoting the “stocks” tibble to widen the data, extracting
columns from the “year” variable and values from the “return”
variable.
stocks %>%
pivot_wider(names_from = year, values_from = return)
Task-3: pivot the data to a wide format with columns for each year’s
returns, then reshape it back to a long format, keeping only the
non-missing values in the “return” column.
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(
cols = c(`2015`, `2016`),
names_to = "year",
values_to = "return",
values_drop_na = TRUE
)
Task-4:Filling missing combinations of “year” and “qtr” in the
“stocks” dataset.
stocks %>%
complete(year, qtr)
Task-5:Creating a tibble named “treatment” containing information
about individuals, their treatment groups, and their responses, with
some missing values for the “person” column.
treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
Task-6: Filling the missing values in the “person” column of the
“treatment” tibble.
treatment %>%
fill(person)
NA
Case Study
Task-1: Loading data set
who
Task-2:Pivoting the “who” dataset from wide to long format,
condensing columns into “cases” and capturing the original column names
in “key”.
who1 <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
)
who1
Task-3:Counting the occurrences of each “key” in the “who1”
dataset.
who1 %>%
count(key)
Task-4:Replacing “newrel” with “new_rel” in the “key” column of the
“who1” dataset to create “who2.”
who2 <- who1 %>%
mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2
Task-5:Separating the “key” column in the “who2” dataset into “new,”
“type,” and “sexage” columns using “_” as the separator to create
“who3.”
who3 <- who2 %>%
separate(key, c("new", "type", "sexage"), sep = "_")
who3
Task-6:Counting the occurrences of each unique value in the “new”
column of the “who3” dataset.
who3 %>%
count(new)
Task-7:Removing the “new”, “iso2”, and “iso3” columns from the “who3”
dataset and assigning the result to “who4”.
who4 <- who3 %>%
select(-new, -iso2, -iso3)
Task-8:Splitting the “sexage” column of the “who4” dataset into “sex”
and “age” columns, separated by the first character, and assigning the
result to “who5”.
who5 <- who4 %>%
separate(sexage, c("sex", "age"), sep = 1)
who5
Task-9:Transforming the “who” dataset from wide to long format,
adjusting column names, extracting meaningful variables, dropping
unnecessary columns, and splitting the “sexage” column into “sex” and
“age”.
who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
) %>%
mutate(
key = stringr::str_replace(key, "newrel", "new_rel")
) %>%
separate(key, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)
NA
CH-13: Relational data
Task-1:Loding the libraries
library(tidyverse)
library(nycflights13)
nycflights13
Task-1: airlines data
airlines
Task-2: airports data
airports
Task-3: planes data
planes
Task-4: weather data
weather
Keys
Task-1Counting the occurrences of each tail number in the “planes”
table and filtering for those with more than one occurrence.
planes %>%
count(tailnum) %>%
filter(n > 1)
Task-2:Counting the occurrences of each combination of year, month,
day, hour, and origin in the “weather” table and filtering for those
with more than one occurrence.
weather %>%
count(year, month, day, hour, origin) %>%
filter(n > 1)
Task-3:Counting the occurrences of each combination of year, month,
day, and flight in the “flights” table and filtering for those with more
than one occurrence.
flights %>%
count(year, month, day, flight) %>%
filter(n > 1)
Task-4:Counting the occurrences of each combination of year, month,
day, and tail number in the “flights” table and filtering for those with
more than one occurrence.
flights %>%
count(year, month, day, tailnum) %>%
filter(n > 1)
Mutating joins
Task-1: Creating a subset of the “flights” table named “flights2”
containing columns from “year” to “day”, “hour”, “origin”, “dest”,
“tailnum”, and “carrier”.
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
Task-2:Removing the “origin” and “dest” columns from “flights2” table
and then performing a left join with the “airlines” table, using the
“carrier” column as the key for matching.
flights2 %>%
select(-origin, -dest) %>%
left_join(airlines, by = "carrier")
Task-3:Shortening the command by removing “selecting” and directly
“mutating” the “name” column with the corresponding airline names from
the “airlines” table based on the “carrier” column.
flights2 %>%
select(-origin, -dest) %>%
mutate(name = airlines$name[match(carrier, airlines$carrier)])
Understanding joins
Task-1:Creating two tibbles, “x” and “y”, each with a “key” column
and an associated “val_x” or “val_y” column, respectively.
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)
x
y
Inner join
Task-1:Joining tibbles x and y using an
inner join operation based on the “key” column.
x %>%
inner_join(y, by = "key")
Duplicate keys
Task-1: Joining tibble x with tibble y using the common column
“key”.
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
1, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2"
)
Task-2:Performing a left join between tibble x and
tibble y based on the common column “key”.
left_join(x, y, by = "key")
Task-3:Creating two tibbles, x and y, with
columns “key”, “val_x”, and “val_y”, populated with corresponding
values.
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
3, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
2, "y3",
3, "y4"
)
Task-4:Performing a left join on tibbles x and
y using the “key” column as the join key.
left_join(x, y, by = "key")
Warning: Detected an unexpected many-to-many relationship between `x` and `y`.
Defining the key columns
Task-1:Performing a left join between the flights2
tibble and the weather tibble.
flights2 %>%
left_join(weather)
Joining with `by = join_by(year, month, day, hour, origin)`
Task-2:Performing a left join between the flights2
tibble and the planes tibble using the “tailnum” column as
the key.
flights2 %>%
left_join(planes, by = "tailnum")
Task-3:Performing a left join between the flights2
tibble and the airports tibble, matching the “dest” column
from flights2 with the “faa” column from
airports.
flights2 %>%
left_join(airports, c("dest" = "faa"))
Task-4:Performing a left join between the flights2
tibble and the airports tibble, matching the “origin”
column from flights2 with the “faa” column from
airports.
flights2 %>%
left_join(airports, c("origin" = "faa"))
Filtering joins
Task-1: Calculating the top 10 destinations by counting the
occurrences in the “dest” column of the flights tibble,
sorted in descending order, and then displaying the result.
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
top_dest
Task-2: Filtering the flights tibble to include only
rows where the destination (dest) matches any of the top 10
destinations identified in the previous step.
flights %>%
filter(dest %in% top_dest$dest)
#%in% operator in R is used to check if elements in one vector are present in another vector
Task-3: Selecting rows from the flights dataset where
the destination airport matches one of the top 10 destinations
previously identified.
flights %>%
semi_join(top_dest)
Joining with `by = join_by(dest)`
Task-4: Filtering out flights with tail numbers present in the planes
dataset and counting the occurrences of each unique tail number, sorting
the result.
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
Set operations
Task-1:creating two tibbles, df1 and df2, each with columns x and y,
containing sample data.
df1 <- tribble(
~x, ~y,
1, 1,
2, 1
)
df2 <- tribble(
~x, ~y,
1, 1,
1, 2
)
Task-2:performing set operations on the tibbles df1 and df2,
including intersection, union, and set differences.
intersect(df1, df2)
union(df1, df2)
setdiff(df1, df2)
setdiff(df2, df1)
CH-14: Strings
Basic Info:string1 <- “This
is a string” string2 <- ‘If I want to include a “quote” inside a
string, I use single quotes’
Task-1:To include a literal single or double quote in a string you
can use to “escape” it
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
Task-2: Understanding the character
x <- c("\"", "\\") #backslash is escape character
x
[1] "\"" "\\"
writeLines(x)
"
\
String length
Task-1:
str_length(c("a", "R for data science", NA))
[1] 1 18 NA
Combining strings
Task-1:Combining the strings
str_c("x", "y")
[1] "xy"
str_c("x", "y", "z")
[1] "xyz"
Task-2:Using the sep argument to control how they’re separated.
str_c("x", "y", sep = ", ")
[1] "x, y"
Task-3:Performing concatenation with “|” and “-” at both ends of each
element of vector x, and replacing NA values with empty strings before
concatenation.
x <- c("abc", NA)
str_c("|-", x, "-|")
[1] "|-abc-|" NA
str_c("|-", str_replace_na(x), "-|")
[1] "|-abc-|" "|-NA-|"
Task-4: concatenating each element of the vector c(“a”, “b”, “c”)
with a prefix “prefix-” and a suffix “-suffix”.
str_c("prefix-", c("a", "b", "c"), "-suffix")
[1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
Task-5: combining strings
name <- "Hadley"
time_of_day <- "morning"
birthday <- FALSE
str_c(
"Good ", time_of_day, " ", name,
if (birthday) " and HAPPY BIRTHDAY",
"."
)
[1] "Good morning Hadley."
Subsetting strings
Task-1:Extracting the first three characters from each element in the
vector x using str_sub.
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
[1] "App" "Ban" "Pea"
Task-2:negative numbers count backwards from end
str_sub(x, -3, -1)
[1] "ple" "ana" "ear"
Task-3:using the assignment form of str_sub() to modify strings
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
[1] "apple" "banana" "pear"
Locales
Task-1:Changing the case
str_to_upper(c("i", "ı"))
[1] "I" "I"
str_to_upper(c("i", "ı"), locale = "tr")
[1] "İ" "I"
Task-2:Sorting the character vector x alphabetically using the
English (en) locale and the Hawaiian (haw) locale.
x <- c("apple", "eggplant", "banana")
str_sort(x, locale = "en")
[1] "apple" "banana" "eggplant"
str_sort(x, locale = "haw")
[1] "apple" "eggplant" "banana"
Matching patterns with regular expressions
Basic matches
Task-1:Searching for the pattern “an” within each element of
x and displaying the matches.
x <- c("apple", "banana", "pear")
str_view(x, "an")
[2] │ b<an><an>a
Task-2:Displaying elements in x where any character is
followed by “a” and then any character.
str_view(x, ".a.")
[2] │ <ban>ana
[3] │ p<ear>
Task-3
# To create the regular expression, we need \\
dot <- "\\."
# But the expression itself only contains one:
writeLines(dot)
\.
# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")
[2] │ <a.c>
Task-4: Displaying elements in x where the sequence “\”
occurs.
x <- "a\\b"
writeLines(x)
a\b
str_view(x, "\\\\")
[1] │ a<\>b
Anchors
Task-1: Displaying elements in x that start with “a” and
end with “a” respectively.
x <- c("apple", "banana", "pear")
str_view(x, "^a")
[1] │ <a>pple
str_view(x, "a$")
[2] │ banan<a>
Task-2: Highlighting “apple” occurrences in x and
instances where it’s the only content.
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
[1] │ <apple> pie
[2] │ <apple>
[3] │ <apple> cake
str_view(x, "^apple$")
[2] │ <apple>
Character classes and alternatives
Task-1: Visualizing patterns matching “a.c”, “a*c”, and “a c” in the
provided character vector.
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
[2] │ <a.c>
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
[3] │ <a*c>
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
[4] │ <a >c
Task-2: Visualizing patterns matching “grey” or “gray” in the
provided character vector.
str_view(c("grey", "gray"), "gr(e|a)y")
[1] │ <grey>
[2] │ <gray>
Repetition
Task-1:Identifying patterns “CC” or “C” in the string “1888 is the
longest year in Roman numerals
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
[1] │ 1888 is the longest year in Roman numerals: MD<CC><C>LXXXVIII
Task-2: Viewing the pattern “CC”
str_view(x, "CC+")
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII
Task-3: Viewing the pattern “C[LX]+”
str_view(x, 'C[LX]+')
[1] │ 1888 is the longest year in Roman numerals: MDCC<CLXXX>VIII
Task-4:Viewing the pattern “C{2},C{2,},c{2,3}”
str_view(x, "C{2}")
[1] │ 1888 is the longest year in Roman numerals: MD<CC>CLXXXVIII
str_view(x, "C{2,}")
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII
str_view(x, "C{2,3}")
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII
Grouping and backreferences
Task-1:Grouping
str_view(fruit, "(..)\\1", match = TRUE)
[4] │ b<anan>a
[20] │ <coco>nut
[22] │ <cucu>mber
[41] │ <juju>be
[56] │ <papa>ya
[73] │ s<alal> berry
Detect matches
Task-1: Checking for the presence of the letter “e” in each word
x <- c("apple", "banana", "pear")
str_detect(x, "e")
[1] TRUE FALSE TRUE
Task-2:Checking how many common words start with t
sum(str_detect(words, "^t"))
[1] 65
Task-3: Checking proportion of common words end with a vowel
mean(str_detect(words, "[aeiou]$"))
[1] 0.2765306
Task-4:Finding all words containing at least one vowel, and
negate
no_vowels_1 <- !str_detect(words, "[aeiou]")
Task-5:Finding all words consisting only of consonants
(non-vowels)
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
[1] TRUE
Task-6: Filtering words that end with the letter “x” from a list of
words.
words[str_detect(words, "x$")]
[1] "box" "sex" "six" "tax"
str_subset(words, "x$")
[1] "box" "sex" "six" "tax"
Task-7: Filtering a tibble for words that end with “x”.
df <- tibble(
word = words,
i = seq_along(word)
)
df %>%
filter(str_detect(word, "x$"))
Task-8:Counting the occurrences of “a” in each element of a character
vector.
x <- c("apple", "banana", "pear")
str_count(x, "a")
[1] 1 3 1
Task-9: Seeing average of how many vowels per word
mean(str_count(words, "[aeiou]"))
[1] 1.991837
Task-10: Adding columns to a tibble to count vowels and consonants in
each word.
df %>%
mutate(
vowels = str_count(word, "[aeiou]"),
consonants = str_count(word, "[^aeiou]")
)
Task-11:Counting “aba” occurrences in “abababa” and showing all “aba”
instances.
str_count("abababa", "aba")
[1] 2
str_view_all("abababa", "aba")
Warning: `str_view_all()` was deprecated in stringr 1.5.0.
Please use `str_view()` instead.
[1] │ <aba>b<aba>
Grouped matches
Task-1: Extracting sentences containing nouns defined by a pattern,
then extracts the nouns from those sentences.
noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
str_subset(noun) %>%
head(10)
has_noun %>%
str_extract(noun)
[1] "the smooth" "the sheet" "the depth" "a chicken" "the parked" "the sun" "the huge" "the ball"
[9] "the woman" "a helps"
Task-2:
has_noun %>%
str_match(noun)
[,1] [,2] [,3]
[1,] "the smooth" "the" "smooth"
[2,] "the sheet" "the" "sheet"
[3,] "the depth" "the" "depth"
[4,] "a chicken" "a" "chicken"
[5,] "the parked" "the" "parked"
[6,] "the sun" "the" "sun"
[7,] "the huge" "the" "huge"
[8,] "the ball" "the" "ball"
[9,] "the woman" "the" "woman"
[10,] "a helps" "a" "helps"
Task-3:Creating a tibble with columns ‘article’ and ‘noun’ extracted
from sentences based on a pattern.
tibble(sentence = sentences) %>%
tidyr::extract(
sentence, c("article", "noun"), "(a|the) ([^ ]+)",
remove = FALSE
)
Replacing matches
Task-1: Replacing the first vowel in each word of x with a hyphen.
Replacing all vowels in each word of x with a hyphen.
x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
[1] "-pple" "p-ar" "b-nana"
str_replace_all(x, "[aeiou]", "-")
[1] "-ppl-" "p--r" "b-n-n-"
Task-2: Replacing numeric values in x with their corresponding word
representations.
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
[1] "one house" "two cars" "three people"
Task-3:Reordering words in sentences by swapping the second and third
word positions.
sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
[1] "The canoe birch slid on the smooth planks." "Glue sheet the to the dark blue background."
[3] "It's to easy tell the depth of a well." "These a days chicken leg is a rare dish."
[5] "Rice often is served in round bowls."
Splitting
Task-1: Splitting the first five sentences into words.
sentences %>%
head(5) %>%
str_split(" ")
[[1]]
[1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
[[2]]
[1] "Glue" "the" "sheet" "to" "the" "dark" "blue"
[8] "background."
[[3]]
[1] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
[[4]]
[1] "These" "days" "a" "chicken" "leg" "is" "a" "rare" "dish."
[[5]]
[1] "Rice" "is" "often" "served" "in" "round" "bowls."
Task-2:Splitting the string ‘a|b|c|d’ by ‘|’ into a vector of
elements.
"a|b|c|d" %>%
str_split("\\|") %>%
.[[1]]
[1] "a" "b" "c" "d"
Task-3:Splitting the first 5 sentences by space into a matrix of
words.
sentences %>%
head(5) %>%
str_split(" ", simplify = TRUE)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks." ""
[2,] "Glue" "the" "sheet" "to" "the" "dark" "blue" "background." ""
[3,] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
[4,] "These" "days" "a" "chicken" "leg" "is" "a" "rare" "dish."
[5,] "Rice" "is" "often" "served" "in" "round" "bowls." "" ""
Task-4:Splitting each field string into two parts at the first
occurrence of ‘:’.
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
[,1] [,2]
[1,] "Name" "Hadley"
[2,] "Country" "NZ"
[3,] "Age" "35"
Task-5: Display word boundaries, split by spaces, and split by word
boundaries, respectively.
x <- "This is a sentence. This is another sentence."
str_view_all(x, boundary("word"))
[1] │ <This> <is> <a> <sentence>. <This> <is> <another> <sentence>.
str_split(x, " ")[[1]]
[1] "This" "is" "a" "sentence." "" "This" "is" "another"
[9] "sentence."
str_split(x, boundary("word"))[[1]]
[1] "This" "is" "a" "sentence" "This" "is" "another" "sentence"
Other types of pattern
Task-1:
# The regular call:
str_view(fruit, "nana")
[4] │ ba<nana>
# Is shorthand for
str_view(fruit, regex("nana"))
[4] │ ba<nana>
Task-2:Visualizing occurrences of “banana” in different case
variations.
bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
[1] │ <banana>
str_view(bananas, regex("banana", ignore_case = TRUE))
[1] │ <banana>
[2] │ <Banana>
[3] │ <BANANA>
Task-3: Extracting all lines starting with “Line” from the text.
x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
[1] "Line"
Task-4: Extracting all occurrences of lines starting with “Line” from
the text, considering each line separately.
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
[1] "Line" "Line" "Line"
Task-5:Creating a regular expression pattern for phone numbers,
allowing for variations in formatting, and attempting to match it
against the provided phone number.
phone <- regex("
\\(? # optional opening parens
(\\d{3}) # area code
[) -]? # optional closing parens, space, or dash
(\\d{3}) # another three numbers
[ -]? # optional space or dash
(\\d{3}) # three more numbers
", comments = TRUE)
str_match("514-791-8141", phone)
[,1] [,2] [,3] [,4]
[1,] "514-791-814" "514" "791" "814"
Task-6:Installling the package and Benchmarking string detection in
“sentences” using fixed and regex patterns 20 times each, comparing
performance with microbenchmark.
package_to_install <- c("microbenchmark")
for (package_name in package_to_install) {
if (!requireNamespace(package_name, quietly = TRUE)) {
install.packages(package_name)
}
}
library(microbenchmark)
microbenchmark::microbenchmark(
fixed = str_detect(sentences, fixed("the")),
regex = str_detect(sentences, "the"),
times = 20
)
Unit: microseconds
Task-7:Starting with a1 being “0e1” and a2 being “a301”, both
representing the character “á”, they are compared for equality.
a1 <- "\u00e1"
a2 <- "a\u0301"
c(a1, a2)
[1] "á" "á"
a1 == a2
[1] FALSE
Task-8: Checking if a1 contains the fixed string
a2 returns FALSE, whereas using collation
rules returns TRUE.
str_detect(a1, fixed(a2))
[1] FALSE
str_detect(a1, coll(a2))
[1] TRUE
Task-9:Creating a vector i with different forms of the
letter “i”, then using str_subset to filter them based on
collation.
i <- c("I", "İ", "i", "ı")
i
[1] "I" "İ" "i" "ı"
str_subset(i, coll("i", ignore_case = TRUE))
[1] "I" "i"
str_subset(i, coll("i", ignore_case = TRUE, locale = "tr"))
[1] "İ" "i"
Task-10: Fetching locale information.
stringi::stri_locale_info()
$Language
[1] "en"
$Country
[1] "US"
$Variant
[1] ""
$Name
[1] "en_US"
Task-11:Visualizing word boundaries and extracts all words from the
string.
x <- "This is a sentence."
str_view_all(x, boundary("word"))
[1] │ <This> <is> <a> <sentence>.
str_extract_all(x, boundary("word"))
[[1]]
[1] "This" "is" "a" "sentence"
CH-15: Factors
Creatig factors
Task-1:Adding character vector in variable x1
x1 <- c("Dec", "Apr", "Jan", "Mar")
Task-2:Adding character vector in variable x2
x2 <- c("Dec", "Apr", "Jam", "Mar")
Task-3:Sorting X1
sort(x1)
[1] "Apr" "Dec" "Jan" "Mar"
Task-4:Adding Character vector in month_levels
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
Task-5:Assigning the factor levels to the variable x1, using the
predefined month_levels.
y1 <- factor(x1, levels = month_levels)
y
Task-6:Sorting the factor levels in y1.
sort(y1)
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Task-7:creating a factor y2 from x2 with custom levels specified by
month_levels.
y2 <- factor(x2, levels = month_levels)
y2
[1] Dec Apr <NA> Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Task-8:parsing the values in x2 as factors
y2 <- parse_factor(x2, levels = month_levels)
Warning: 1 parsing failure.
row col expected actual
3 -- value in level set Jam
Task-9: omitting the levels.
factor(x1)
[1] Dec Apr Jan Mar
Levels: Apr Dec Jan Mar
Task-10:Creating a factor f1 from the values in x1, using the unique
values of x1 as levels.
f1 <- factor(x1, levels = unique(x1))
f1
[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar
Task-11: creating a factor f2 from the values in x1, ordering them
according to their appearance in x1.
f2 <- x1 %>% factor() %>% fct_inorder()
f2
[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar
Task-12:Omitting levels2
levels(f2)
[1] "Dec" "Apr" "Jan" "Mar"
General Social Survey
Task-1:Loading datasets
gss_cat
Task-2:Seeing levels through count()
gss_cat %>%
count(race)
Task-3:Also seeing through bar()
ggplot(gss_cat, aes(race)) +
geom_bar()

Task-4:Generating a bar plot using ggplot()
ggplot(gss_cat,aes(race))+geom_bar()+scale_x_discrete(drop=FALSE)

Modifying factor order
Task-1:calculating summary statistics and then creating scatter
plot
relig_summary <- gss_cat %>%
group_by(relig) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()

Task-2:Generating a scatter plot using ggplot, where the
x-axis represents the mean TV hours (tvhours), and the
y-axis represents the relig variable reordered by mean TV
hours.
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
geom_point()

Task-3:Creating a scatter plot using ggplot.
relig_summary %>%
mutate(relig = fct_reorder(relig, tvhours)) %>%
ggplot(aes(tvhours, relig)) +
geom_point()

Task-4:Generating a scatter plot using ggplot
rincome_summary <- gss_cat %>%
group_by(rincome) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()

Task-5: creates a scatter plot of the average age by income level,
with “Not applicable” as the reference level for income
ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
geom_point()

Task-6:calculating the proportion of each marital status group across
different age groups and creates a line plot showing the distribution of
marital status proportions by age.
by_age <- gss_cat %>%
filter(!is.na(age)) %>%
count(age, marital) %>%
group_by(age) %>%
mutate(prop = n / sum(n))
ggplot(by_age, aes(age, prop, colour = marital)) +
geom_line(na.rm = TRUE)

ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
geom_line() +
labs(colour = "marital")

Task-7: Adjusting the order of the “marital” variable based on
frequency and then reverses the order before generating a bar plot
illustrating the distribution of marital status.
gss_cat %>%
mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
ggplot(aes(marital)) +
geom_bar()

Modifying factor levels
Task-1: counting the frequency of each unique value in the “partyid”
variable of the “gss_cat” dataset.
gss_cat%>%count(partyid)
Task-2:Recording the levels of the “partyid” variable in the
“gss_cat” dataset and then counts the frequency of each unique recorded
value.
gss_cat %>%
mutate( partyid=fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
))%>%
count(partyid)
Task-3:Recategorizing and counting party affiliations in the
“gss_cat” dataset.
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)) %>%
count(partyid)
Task-4: Collapsing categories within the “partyid” variable in the
“gss_cat” dataset into broader groups and then counting the frequency of
each collapsed category.
gss_cat%>%
mutate(partyid=fct_collapse(partyid,
other=c("No answer", "Don't know", "Other party"),
rep=c("Strong republican", "Not str republican"),
ind=c("Ind,near rep", "Independent", "Ind,near dem"),
dem=c("Not str democrat", "Strong democrat"))) %>%
count(partyid)
Task-5:Counting and aggregating religious affiliations in the
“gss_cat” dataset after lumping together less frequent categories.
gss_cat %>%
mutate(relig = fct_lump(relig)) %>%
count(relig)
Task-6:“Summarizing religious affiliations after lumping infrequent
categories and sort.”
gss_cat %>%
mutate(relig = fct_lump(relig, n = 10)) %>%
count(relig, sort = TRUE) %>%
print(n = Inf)
CH-Data and Times
Task-1:Loading library
library(tidyverse)
library(lubridate)
library(nycflights13)
Creating dates/times
Task-1: Printing current date or date-time
today()
[1] "2024-05-04"
now()
[1] "2024-05-04 20:51:16 +0545"
Form strings
Task-2:converting date strings to date objects in different
formats.
ymd("2017-01-31")
[1] "2017-01-31"
mdy("January 31st, 2017")
[1] "2017-01-31"
dmy("31-Jan-2017")
[1] "2017-01-31"
ymd(20170131)
[1] "2017-01-31"
ymd_hms("2017-01-31 20:11:59")
[1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
[1] "2017-01-31 08:01:00 UTC"
flights %>%
select(year, month, day, hour, minute)
flights %>%
select(year, month, day, hour, minute) %>%
mutate(departure = make_datetime(year, month, day, hour, minute))
Task: Creating date-time objects from hour-minute time data in the
‘flights’ dataset and filtering out rows with missing departure or
arrival times
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt
Task: Plotting the frequency of flights over time using departure
date-time
flights_dt %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 86400)

Task: Plotting the frequency of flights over time for a specific
period using departure date-time
flights_dt %>%
filter(dep_time < ymd(20130102)) %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 600) # 600 s = 10 minutes

Task: to convert today’s date to date-time object
as_datetime(today())
[1] "2024-05-04 UTC"
as_date(now())
[1] "2024-05-04"
as_date(365 * 10 + 2)
[1] "1980-01-01"
Date-time components Task: Extracting various components of a
date-time object
datetime <- ymd_hms("2016-07-08 12:34:56")
year(datetime)
[1] 2016
month(datetime)
[1] 7
mday(datetime)
[1] 8
yday(datetime)
[1] 190
wday(datetime)
[1] 6
month(datetime, label = TRUE)
[1] Jul
Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < Nov < Dec
wday(datetime, label = TRUE, abbr = FALSE)
[1] Friday
Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < Friday < Saturday
Task: Plotting the frequency of flights by day of the week
flights_dt %>%
mutate(wday = wday(dep_time, label = TRUE)) %>%
ggplot(aes(x = wday)) +
geom_bar()

Task: Plotting average delay by minute of departure time
flights_dt %>%
mutate(minute = minute(dep_time)) %>%
group_by(minute) %>%
summarise(
avg_delay = mean(arr_delay, na.rm = TRUE),
n = n()) %>%
ggplot(aes(minute, avg_delay)) +
geom_line()

Task: Plotting average delay by minute of scheduled departure
time
sched_dep <- flights_dt %>%
mutate(minute = minute(sched_dep_time)) %>%
group_by(minute) %>%
summarise(
avg_delay = mean(arr_delay, na.rm = TRUE),
n = n())
ggplot(sched_dep, aes(minute, avg_delay)) +
geom_line()

Task: Plotting the number of flights by minute of scheduled departure
time
ggplot(sched_dep, aes(minute, n)) +
geom_line()

Rounding Task:Plotting the number of flights by week, rounding to the
nearest week
flights_dt %>%
count(week = floor_date(dep_time, "week")) %>%
ggplot(aes(week, n)) +
geom_line()

setting compounds Task: Setting up a date-time object
(datetime <- ymd_hms("2016-07-08 12:34:56"))
[1] "2016-07-08 12:34:56 UTC"
year(datetime) <- 2020
datetime
[1] "2020-07-08 12:34:56 UTC"
month(datetime) <- 01
datetime
[1] "2020-01-08 12:34:56 UTC"
hour(datetime) <- hour(datetime) + 1
datetime
[1] "2020-01-08 13:34:56 UTC"
update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
[1] "2020-02-02 02:34:56 UTC"
ymd("2015-02-01") %>%
update(mday = 30)
[1] "2015-03-02"
ymd("2015-02-01") %>%
update(hour = 400)
[1] "2015-02-17 16:00:00 UTC"
Task: Creating a new variable ‘dep_hour’ by updating the ‘dep_time’
to the first day of the year
flights_dt %>%
mutate(dep_hour = update(dep_time, yday = 1)) %>%
ggplot(aes(dep_hour)) +
geom_freqpoly(binwidth = 300)

Time Spans Compute the age of a person based on their birthdate and
today’s date
h_age <- today() - ymd(19791014)
h_age
Time difference of 16274 days
as.duration(h_age)
[1] "1406073600s (~44.56 years)"
dseconds(15)
[1] "15s"
dminutes(10)
[1] "600s (~10 minutes)"
dhours(c(12, 24))
[1] "43200s (~12 hours)" "86400s (~1 days)"
ddays(0:5)
[1] "0s" "86400s (~1 days)" "172800s (~2 days)" "259200s (~3 days)" "345600s (~4 days)"
[6] "432000s (~5 days)"
dweeks(3)
[1] "1814400s (~3 weeks)"
dyears(1)
[1] "31557600s (~1 years)"
2 * dyears(1)
[1] "63115200s (~2 years)"
dyears(1) + dweeks(12) + dhours(15)
[1] "38869200s (~1.23 years)"
tomorrow <- today() + ddays(1)
last_year <- today() - dyears(1)
one_pm <- ymd_hms("2016-03-12 13:00:00", tz = "America/New_York")
one_pm
[1] "2016-03-12 13:00:00 EST"
one_pm + ddays(1)
[1] "2016-03-13 14:00:00 EDT"
Periods Create period objects representing different time spans and
Perform arithmetic operations with period objects
one_pm
[1] "2016-03-12 13:00:00 EST"
one_om = days(1)
seconds(15)
[1] "15S"
minutes(10)
[1] "10M 0S"
hours(c(12, 24))
[1] "12H 0M 0S" "24H 0M 0S"
days(7)
[1] "7d 0H 0M 0S"
months(1:6)
[1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S" "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"
weeks(3)
[1] "21d 0H 0M 0S"
years(1)
[1] "1y 0m 0d 0H 0M 0S"
10 * (months(6) + days(1))
[1] "60m 10d 0H 0M 0S"
days(50) + hours(25) + minutes(2)
[1] "50d 25H 2M 0S"
ymd("2016-01-01") + dyears(1)
[1] "2016-12-31 06:00:00 UTC"
ymd("2016-01-01") + years(1)
[1] "2017-01-01"
one_pm + ddays(1)
[1] "2016-03-13 14:00:00 EDT"
one_pm + days(1)
[1] "2016-03-13 13:00:00 EDT"
Filter flights where arrival time is before departure time
flights_dt %>%
filter(arr_time < dep_time)
Update flights data to correct overnight flights
flights_dt <- flights_dt %>%
mutate(
overnight = arr_time < dep_time,
arr_time = arr_time + days(overnight * 1),
sched_arr_time = sched_arr_time + days(overnight * 1)
)
Filter flights where overnight condition is true and arrival time is
before departure time
flights_dt %>%
filter(overnight, arr_time < dep_time)
Intervals Calculate the ratio of one year in days
years(1) / days(1)
[1] 365.25
next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)
[1] 365
(today() %--% next_year) %/% days(1)
[1] 365
Display time zone information
Sys.timezone()
[1] "Asia/Katmandu"
length(OlsonNames())
[1] 596
head(OlsonNames())
[1] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa" "Africa/Algiers" "Africa/Asmara"
[6] "Africa/Asmera"
(x1 <- ymd_hms("2015-06-01 12:00:00", tz = "America/New_York"))
[1] "2015-06-01 12:00:00 EDT"
(x2 <- ymd_hms("2015-06-01 18:00:00", tz = "Europe/Copenhagen"))
[1] "2015-06-01 18:00:00 CEST"
(x3 <- ymd_hms("2015-06-02 04:00:00", tz = "Pacific/Auckland"))
[1] "2015-06-02 04:00:00 NZST"
x1 - x2
Time difference of 0 secs
x1 - x3
Time difference of 0 secs
Pipes
Task: To import the required library
packages_to_install <- c("tidyverse", "pryr")
for (package_name in packages_to_install) {
if (!requireNamespace(package_name, quietly = TRUE)) {
install.packages(package_name)
}
library(package_name, character.only = TRUE)
}
library(magrittr)
Create diamond data and calculate the object sizes
diamonds <- ggplot2::diamonds
diamonds2 <- diamonds %>%
dplyr::mutate(price_per_carat = price / carat)
pryr::object_size(diamonds)
3.46 MB
pryr::object_size(diamonds2)
3.89 MB
pryr::object_size(diamonds, diamonds2)
3.89 MB
Functions Normalize the columns of a data frame
df <- tibble::tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df$a <- (df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) /
(max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) /
(max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) /
(max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
Normalize a single column of a data frame
(df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
[1] 0.2660918 0.1288832 0.0769690 0.3163641 0.5612945 0.6241704 0.5271891 0.0000000 0.3913369 1.0000000
x <- df$a
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
[1] 0.2660918 0.1288832 0.0769690 0.3163641 0.5612945 0.6241704 0.5271891 0.0000000 0.3913369 1.0000000
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
[1] 0.2660918 0.1288832 0.0769690 0.3163641 0.5612945 0.6241704 0.5271891 0.0000000 0.3913369 1.0000000
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(c(0, 5, 10))
[1] 0.0 0.5 1.0
Rescale a vector to the range [0, 1]
rescale01(c(-10, 0, 10))
[1] 0.0 0.5 1.0
rescale01(c(1, 2, 3, NA, 5))
[1] 0.00 0.25 0.50 NA 1.00
Rescale each column of a DataFrame to the range [0, 1]
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
x <- c(1:10, Inf)
rescale01(x)
[1] 0 0 0 0 0 0 0 0 0 0 NaN
Define the rescale01 function and apply it
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE, finite = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(x)
[1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667 0.7777778 0.8888889 1.0000000
[11] Inf
Load required libraries and packages
library(tidyverse)
library(purrr)
library(magrittr)
# install.packages("pryr")
library(pryr)
18.2 Piping alternatives
This is a popular Children’s poem that is accompanied by hand
actions.We’ll start by defining an object to represent little bunny Foo
Foo:
# foo_foo <- little_bunny()
18.2.2 Overwrite the original
Instead of creating intermediate objects at each step, we could
overwrite the original object:
# foo_foo <- hop(foo_foo, through = forest)
# foo_foo <- scoop(foo_foo, up = field_mice)
# foo_foo <- bop(foo_foo, on = head)
18.2.3 Function composition
Another approach is to abandon assignment and just string the
function calls together:
# bop(
# scoop(
# hop(foo_foo, through = forest),
# up = field_mice
# ),
# on = head
# )
Here the disadvantage is that you have to read from inside-out, from
right-to-left, and that the arguments end up spread far apart
(evocatively called the dagwood sandwhich problem). In short, this code
is hard for a human to consume.
18.2.4 Use the pipe
Finally, we can use the pipe:
# foo_foo %>%
# hop(through = forest) %>%
# scoop(up = field_mice) %>%
# bop(on = head)
# my_pipe <- function(.) {
# . <- hop(., through = forest)
# . <- scoop(., up = field_mice)
# bop(., on = head)
# }
# my_pipe(foo_foo)
TASK: Functions that use the current environment. For example,
assign() will create a new variable with the given name in
the current environment:
assign("x",10)
x
[1] 10
"x" %>% assign(100)
x
[1] 10
Assign value to “x” in the specified environment and check its value
and Generate random numbers, create a matrix, plot it, and inspect its
structure
env <- environment()
"x" %>% assign(100,envir=env)
x
[1] 100
rnorm(100) %>%
matrix(ncol=2) %>%
plot() %>%
str()
NULL

rnorm(100) %>%
matrix(ncol=2) %>%
plot() %>%
str()
NULL

ndist <- rnorm(100000)
hist(ndist)

Calculate the correlation between two variables in mtcars dataset
mtcars %$%
cor(disp, mpg)
[1] -0.8475514
- For assignment magrittr provides the
%<>%
operator which allows you to replace code like:
mtcars <- mtcars %>%
transform(cyl=cyl*2)
mtcars %<>% transform(cyl=cyl*2)
Chapter 19 Functions
19.1 Introduction
19.2 When should you write a function?
df <- tibble::tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df
df$a <- (df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) /
(max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) /
(max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) /
(max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
Rescale a single variable in a data frame
(df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
[1] 0.5235603 0.5627738 0.2777349 0.0000000 1.0000000 0.1019340 0.3832672 0.9030480 0.6526363 0.6825633
Rescale a single variable without creating a new object
x <- df$a
(x - min(x, na.rm = T)) / (max(x, na.rm = T)-min(x, na.rm = T))
[1] 0.5235603 0.5627738 0.2777349 0.0000000 1.0000000 0.1019340 0.3832672 0.9030480 0.6526363 0.6825633
Task: There is some duplication in this code. We’re computing the
range of the data three times, so it makes sense to do it in one
step:
rng <- range(x, na.rm = T)
(x-rng[1])/(rng[2]-rng[1])
[1] 0.5235603 0.5627738 0.2777349 0.0000000 1.0000000 0.1019340 0.3832672 0.9030480 0.6526363 0.6825633
Pulling out intermediate calculations into named variables is a good
practice because it makes it more clear what the code is doing. Now that
I’ve simplified the code, and checked that it still works, I can turn it
into a function:
rescale01 <- function(x){
rng <- range(x, na.rm = T)
(x-rng[1])/(rng[2]-rng[1])
}
rescale01(c(0,5,10))
[1] 0.0 0.5 1.0
Test the rescale01 function with various inputs
rescale01(c(-10,0,10))
[1] 0.0 0.5 1.0
rescale01(c(1,2,3,NA,5))
[1] 0.00 0.25 0.50 NA 1.00
We can simplify the original example now that we have a function:
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
Rescale a vector with infinite values
x <- c(1:10,Inf)
rescale01(x)
[1] 0 0 0 0 0 0 0 0 0 0 NaN
Because we’ve extracted the code into a function, we only need to
make the fix in one place:
rescale01 <- function(x){
rng <- range(x,na.rm=T,finite=T)
(x-rng[1])/(rng[2]-rng[1])
}
rescale01(x)
[1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667 0.7777778 0.8888889 1.0000000
[11] Inf
19.4 Conditional execution
An if statement allows you to conditionally execute
code. It looks like this:
# if (condition) {
# code executed when condition is TRUE
# } else {
# code executed when condition is FALSE
# }
Define a function to check if an object has names
has_name <- function(x){
nms <- names(x)
if(is.null(nms)){
rep(FALSE,length(x))
}else {
!is.na(nms) & nms !=""
}
}
19.4.1 Conditions
how if condition works with warnings
# if (c(TRUE,FALSE)){}
#> Warning in if (c(TRUE, FALSE)) {: the condition has length > 1 and only the
#> first element will be used
#> NULL
# if (NA) {}
Check if two objects are identical
identical(0L,0)
[1] FALSE
x <- sqrt(2)^2
x==2
[1] FALSE
x-2
[1] 4.440892e-16
19.4.2 Multiple conditions
You can chain multiple if statement together:
# if (this) {
# # do that
# } else if (that) {
# # do something else
# } else {
# #
# }
#> function(x, y, op) {
#> switch(op,
#> plus = x + y,
#> minus = x - y,
#> times = x * y,
#> divide = x / y,
#> stop("Unknown op!")
#> )
#> }
19.4.3 Code style
Good practice for writing if statements
# Good
# if (y < 0 && debug) {
# message("Y is negative")
# }
#
# if (y == 0) {
# log(x)
# } else {
# y ^ x
# }
#
# # Bad
# if (y < 0 && debug)
# message("Y is negative")
#
# if (y == 0) {
# log(x)
# }
# else {
# y ^ x
# }
It’s ok to drop the curly braces if you have a very short if
statement that can fit on one line:
y <- 10
x <- if (y < 20) "Too low" else "Too high"
I recommend this only for very brief if statements.
Otherwise, the full form is easier to read:
if (y < 20) {
x <- "Too low"
} else {
x <- "Too high"
}
19.5 Function arguments
# Compute confidence interval around mean using normal approximation
mean_ci <- function(x, conf = 0.95) {
se <- sd(x) / sqrt(length(x))
alpha <- 1 - conf
mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}
x <- runif(100)
mean_ci(x)
[1] 0.4370008 0.5485763
mean_ci(x, conf = 0.99)
[1] 0.4194710 0.5661061
19.5.1 Choosing names
19.5.2 Cheking values
wt_mean <- function(x, w) {
sum(x * w) / sum(w)
}
wt_var <- function(x, w) {
mu <- wt_mean(x, w)
sum(w * (x - mu) ^ 2) / sum(w)
}
wt_sd <- function(x, w) {
sqrt(wt_var(x, w))
}
What happens if x and w are not the same length?
wt_mean(1:6, 1:3)
[1] 7.666667
In this case, because of R’s vector recycling rules, we don’t get an
error.
It’s good practice to check important preconditions, and throw an
error (with stop()), if they are not true:
wt_mean <- function(x, w) {
if (length(x) != length(w)) {
stop("`x` and `w` must be the same length", call. = FALSE)
}
sum(w * x) / sum(w)
}
wt_mean <- function(x, w, na.rm = FALSE) {
if (!is.logical(na.rm)) {
stop("`na.rm` must be logical")
}
if (length(na.rm) != 1) {
stop("`na.rm` must be length 1")
}
if (length(x) != length(w)) {
stop("`x` and `w` must be the same length", call. = FALSE)
}
if (na.rm) {
miss <- is.na(x) | is.na(w)
x <- x[!miss]
w <- w[!miss]
}
sum(w * x) / sum(w)
}
This is a lot of extra work for little additional gain. A useful
compromise is the built-in stopifnot(): it checks that each
argument is TRUE, and produces a generic error message if
not.
wt_mean <- function(x, w, na.rm = FALSE) {
stopifnot(is.logical(na.rm), length(na.rm) == 1)
stopifnot(length(x) == length(w))
if (na.rm) {
miss <- is.na(x) | is.na(w)
x <- x[!miss]
w <- w[!miss]
}
sum(w * x) / sum(w)
}
19.5.3 Dot-dot-dot(…)
Many functions in R take an arbitrary number of inputs:
sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
[1] 55
stringr::str_c("a", "b", "c", "d", "e", "f")
[1] "abcdef"
Define a function to concatenate strings with commas
commas <- function(...) stringr::str_c(..., collapse = ", ")
commas(letters[1:10])
[1] "a, b, c, d, e, f, g, h, i, j"
rule <- function(..., pad = "-") {
title <- paste0(...)
width <- getOption("width") - nchar(title) - 5
cat(title, " ", stringr::str_dup(pad, width), "\n", sep = "")
}
rule("Important output")
Important output ----------------------------------------------------------------------------------------
x <- c(1,2)
sum(x,na.rm=T)
[1] 3
Define a function ‘complicated_function’ with conditions to return 0
if ‘x’ or ‘y’ is empty
complicated_function <- function(x,y,z){
if (lenth(x)==0 || length(y)==0){
return(0)
}
}
Improve readability of if-else blocks by using early return for
simple cases
f <- function() {
if (x) {
# Do
# something
# that
# takes
# many
# lines
# to
# express
} else {
# return something short
}
}
But if the first block is very long, by the time you get to the else,
you’ve forgotten the condition. One way to rewrite it is to use an early
return for the simple case:
f <- function() {
if (!x) {
return(something_short)
}
# Do
# something
# that
# takes
# many
# lines
# to
# express
}
This tends to make the code easier to understand, because you don’t
need quite so much context to understand it.
19.6.2 Writing pipeable functions
Define a function to show the count of missing values in a data
frame
show_missing <- function(df){
n <- sum(is.na(df))
cat("Missing values:",n,"\n",sep="")
invisible(df)
}
If we call it interatively, the invisible() means that
the input df does not get printed out:
show_missing(mtcars)
Missing values:0
But it’s still there, it’s not just printed by default:
x <- show_missing(mtcars)
Missing values:0
class(x)
[1] "data.frame"
dim(x)
[1] 32 11
And we can still use it in a pipe:
library(magrittr)
library(tidyverse)
mtcars %>%
show_missing() %>%
mutate(mpg=ifelse(mpg<20,NA,mpg)) %>%
show_missing()
Missing values:0
Missing values:18
19.7 Environment
Define a function ‘f’ that takes an argument ‘x’ and returns the sum
of ‘x’ and ‘y’
f <- function(x){
x+y
}
Demonstrate how changing the value of ‘y’ affects the result of
calling function ‘f’
y <- 100
f(10)
[1] 110
y <- 1000
f(10)
[1] 1010
Overload the ‘+’ operator to behave differently based on a random
condition
`+` <- function(x, y) {
if (runif(1) < 0.1) {
sum(x, y)
} else {
sum(x, y) * 1.1
}
}
table(replicate(1000, 1 + 2))
3 3.3
94 906
#>
#> 3 3.3
#> 100 900
rm(`+`)
Chapter 20: Vectors
20.1.1 PRerequisites
library(tidyverse)
20.2 Vector basics
Determine the data type of different vectors
typeof(letters)
[1] "character"
typeof(1:10)
[1] "integer"
Determine the length of a list and display its contents
x <- list("a","b",1:10)
length(x)
[1] 3
x
[[1]]
[1] "a"
[[2]]
[1] "b"
[[3]]
[1] 1 2 3 4 5 6 7 8 9 10
Demonstrate modulo operation and creation of logical vectors
1:10 %% 3 ==0
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
c(T,T,F,NA)
[1] TRUE TRUE FALSE NA
20.3.2 Numeric
Integer and double vectors are known collectively as numeric vectors.
In R, numbers are doubles by default. To make an integer, place an L
after the number:
typeof(1)
[1] "double"
typeof(1L)
[1] "integer"
1.5
[1] 1.5
Demonstrate the behavior of floating point arithmetic
x <- sqrt(2)^2
x
[1] 2
x-2
[1] 4.440892e-16
Demonstrate the behavior of division by zero
c(-1,0,1)%/% 0
[1] -Inf NaN Inf
# [1] -Inf NaN Inf
20.3.3 Character
Determine the memory size of a string and a replicated string
vector
x <- "This is a reasonably long string."
pryr::object_size(x)
152 B
y <- rep(x,1000)
pryr::object_size(y)
8.14 kB
20.3.4 Missing values
Note that each type of atomic vector has its own missing value:
NA # logical
[1] NA
NA_integer_ # integer
[1] NA
NA_real_ # double
[1] NA
NA_character_ # character
[1] NA
Calculate the number and proportion of elements in a vector greater
than 10
x <- sample(20,100,replace=T)
y <- x > 10
sum(y) # how many are greater than 10?
[1] 49
mean(y) # what proportion are greater than 10?
[1] 0.49
if (length(x)){
}
NULL
Determine the data type of different vectors
typeof(c(TRUE,1L))
[1] "integer"
typeof(c(1L,1.5))
[1] "double"
typeof(c(1.5,"a"))
[1] "character"
Generate random numeric or logical vectors
sample(10)+100
[1] 110 103 104 101 102 107 109 106 105 108
runif(10)>0.5
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
Demonstrate vector arithmetic with vectors of different lengths
1:10 +1:2
[1] 2 4 4 6 6 8 8 10 10 12
1:10+1:3
Warning: longer object length is not a multiple of shorter object length
[1] 2 4 6 5 7 9 8 10 12 11
Create a tibble with two columns, ‘x’ and ‘y’, with different
lengths
library(tidyverse)
tibble(
x=1:4,
y=rep(1:2,each=2)
)
20.4.4 Naming vectors
All types of vectors can be named. You can name them during creatin
with c():
c(x=1,y=2,z=4)
x y z
1 2 4
Or after the fact with purr::set_names()
set_names(1:3,c("a","b","c"))
a b c
1 2 3
Named vectors are most useful for subsetting, described next.
20.4.5 Subsetting
Demonstrate subsetting vectors with integer indices
x <- c("one","two","three","four","five")
x[c(3,2,5)]
[1] "three" "two" "five"
By repeating a position, you can actually make a longer output than
input:
x[c(1,1,5,5,5,2)]
[1] "one" "one" "five" "five" "five" "two"
Negative values drop the elements at the specified positions:
x[c(-1,-3,-5)]
[1] "two" "four"
The error message mentions subsetting with zero, which returns no
values:
x[0]
character(0)
library(tidyverse)
x <- c(10,3,NA,5,8,1)
# tibble test
x <- as.tibble(x,ncol=1)
Warning: `as.tibble()` was deprecated in tibble 2.0.0.
Please use `as_tibble()` instead.
The signature and semantics have changed, see `?as_tibble`.
names(x)="v1"
is.na(x)
v1
[1,] FALSE
[2,] FALSE
[3,] TRUE
[4,] FALSE
[5,] FALSE
[6,] FALSE
x %>% filter(v1 == NA)
# all non-missing values of x
x <- c(10,3,NA,5,8,1)
x[!is.na(x)]
[1] 10 3 5 8 1
# all even (or missing) values of x
x[x %% 2==0]
[1] 10 NA 8
- If you have a named vector, you can subset it with a character
vector:
x <- c(abc=1, def=2,xyz=5)
x[c("xyz","def")]
xyz def
5 2
20.5 Recursive vectors (lists)
Create a list with numeric elements
x <- list(1,2,3)
x
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
Display the structure of lists with and without names
str(x)
List of 3
$ : num 1
$ : num 2
$ : num 3
x_named <- list(a=1,b=2,c=3)
str(x_named)
List of 3
$ a: num 1
$ b: num 2
$ c: num 3
Unlike atomic vectors, list() can contain a mix of
objects:
y <- list("a",1L,1.5,T)
str(y)
List of 4
$ : chr "a"
$ : int 1
$ : num 1.5
$ : logi TRUE
List can even contain other lists!
z <- list(list(1,2),list(3,4))
str(z)
List of 2
$ :List of 2
..$ : num 1
..$ : num 2
$ :List of 2
..$ : num 3
..$ : num 4
20.5.1 Visualizing lists
x1 <- list(c(1,2),c(3,4))
x2 <- list(list(1,2),list(3,4))
x3 <- list(1,list(2,list(3)))
x1
[[1]]
[1] 1 2
[[2]]
[1] 3 4
x2
[[1]]
[[1]][[1]]
[1] 1
[[1]][[2]]
[1] 2
[[2]]
[[2]][[1]]
[1] 3
[[2]][[2]]
[1] 4
x3
[[1]]
[1] 1
[[2]]
[[2]][[1]]
[1] 2
[[2]][[2]]
[[2]][[2]][[1]]
[1] 3
20.5.2 Subsetting
Create a list ‘a’ with named elements and demonstrate subsetting
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
str(a)
List of 4
$ a: int [1:3] 1 2 3
$ b: chr "a string"
$ c: num 3.14
$ d:List of 2
..$ : num -1
..$ : num -5
str(a[1:2])
List of 2
$ a: int [1:3] 1 2 3
$ b: chr "a string"
str(a[4])
List of 1
$ d:List of 2
..$ : num -1
..$ : num -5
Demonstrate subsetting lists using double square brackets
str(a[[1]])
int [1:3] 1 2 3
str(a[[4]])
List of 2
$ : num -1
$ : num -5
Access list elements by name using $ or [[ ]]
a$a
[1] 1 2 3
a[["a"]]
[1] 1 2 3
20.6 Attributes
Demonstrate setting and retrieving attributes of vectors
x <- 1:10
attr(x,"greeting")
NULL
attr(x,"greeting") <- "Hi!"
attr(x,"farewell") <- "Bye!"
attributes(x)
$greeting
[1] "Hi!"
$farewell
[1] "Bye!"
Demonstrate methods for class ‘Date’
as.Date
function (x, ...)
UseMethod("as.Date")
<bytecode: 0x0000022ea4225378>
<environment: namespace:base>
methods("as.Date")
[1] as.Date.character as.Date.default as.Date.factor as.Date.numeric as.Date.POSIXct
[6] as.Date.POSIXlt as.Date.vctrs_sclr* as.Date.vctrs_vctr*
see '?methods' for accessing help and source code
Retrieve specific methods for ‘as.Date’
getS3method("as.Date","default")
function (x, ...)
{
if (inherits(x, "Date"))
x
else if (is.null(x))
.Date(numeric())
else if (is.logical(x) && all(is.na(x)))
.Date(as.numeric(x))
else stop(gettextf("do not know how to convert '%s' to class %s",
deparse1(substitute(x)), dQuote("Date")), domain = NA)
}
<bytecode: 0x0000022eb85b7c20>
<environment: namespace:base>
getS3method("as.Date","numeric")
function (x, origin, ...)
if (missing(origin)) .Date(x) else as.Date(origin, ...) + x
<bytecode: 0x0000022ec2ac86e0>
<environment: namespace:base>
20.7.1 Factors
Demonstrate creating a factor and inspecting its attributes
x <- factor(c("ab","cd","ab"),levels=c("ab","cd","ed"))
typeof(x)
[1] "integer"
attributes(x)
$levels
[1] "ab" "cd" "ed"
$class
[1] "factor"
20.7.2 Dates and date-times
Dates in R are numeric vectors that represent the number of days
since 1 January 1970.
x <- as.Date("1971-01-01")
unclass(x)
[1] 365
typeof(x)
[1] "double"
attributes(x)
$class
[1] "Date"
Demonstrate creating and inspecting a date-time object
x <- lubridate::ymd_hm("1970-01-01 01:00")
unclass(x)
[1] 3600
attr(,"tzone")
[1] "UTC"
typeof(x)
[1] "double"
attributes(x)
$class
[1] "POSIXct" "POSIXt"
$tzone
[1] "UTC"
Demonstrate setting and retrieving time zone for date-time object
attr(x,"tzone") <- "US/Pacific"
x
[1] "1969-12-31 17:00:00 PST"
attr(x,"tzone") <- "US/Eastern"
x
[1] "1969-12-31 20:00:00 EST"
There is another type of date-times called POSIXIt. There are built
on top of named lists:
y <- as.POSIXlt(x)
typeof(y)
[1] "list"
#> [1] "list"
attributes(y)
$names
[1] "sec" "min" "hour" "mday" "mon" "year" "wday" "yday" "isdst" "zone" "gmtoff"
$class
[1] "POSIXlt" "POSIXt"
$tzone
[1] "US/Eastern" "EST" "EDT"
$balanced
[1] TRUE
20.7.3 Tibbles
Tibbles are augmented lists: they have class “tbl_df” + “tbl” +
“data.frame”, and names (column) and row.names
attributes:
tb <- tibble::tibble(x = 1:5, y = 5:1)
typeof(tb)
[1] "list"
attributes(tb)
$class
[1] "tbl_df" "tbl" "data.frame"
$row.names
[1] 1 2 3 4 5
$names
[1] "x" "y"
df <- data.frame(x = 1:5, y = 5:1)
typeof(df)
[1] "list"
attributes(df)
$names
[1] "x" "y"
$class
[1] "data.frame"
$row.names
[1] 1 2 3 4 5
Chapter 21: Iteration
21.1.1 Prerequisites
library(tidyverse)
21.2 For loops
Imagine we have this simple tibble:
df <- tibble(
a=rnorm(10),
b=rnorm(10),
c=rnorm(10),
d=rnorm(10)
)
Calculate the median for each column in a tibble
median(df$a)
[1] -0.3157254
median(df$b)
[1] -0.8006407
median(df$c)
[1] -0.2668019
median(df$d)
[1] -0.02814063
Calculate the median for each column in the data frame ‘df’ using a
for loop
df
output <- vector("double",ncol(df))
for (i in seq_along(df)){
output[[i]] <- median(df[[i]])
}
output <- tibble(output)
Demonstrate the behavior of seq_along and length functions with an
empty vector ‘y’
y <- vector("double", 0)
seq_along(y)
integer(0)
#> integer(0)
1:length(y)
[1] 1 0
#> [1] 1 0
21.3.1v Modifying an existing object
Sometimes, you want to use a for loop to modify an existing object.
For example, remember our challenges from functions. We wanted to
rescale every column in a data frame:
library(tidyverse)
df <- tibble(
a=rnorm(10),
b=rnorm(10),
c=rnorm(10),
d=rnorm(10)
)
rescale01 <- function(x){
rng <- range(x,na.rm=T)
(x-rng[1])/(rng[2]-rng[1])
}
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
df
for ( i in seq_along(df)){
df[[i]] <- rescale01(df[[i]])
}
21.3.2 Looping patterns
x
[1] "1969-12-31 20:00:00 EST"
results <- vector("list",length(x))
names(results) <- names(x)
Demonstrate looping patterns using a for loop to iterate over a list
‘x’ and store results in a list ‘results’
for(i in seq_along(x)){
name <- names(x)[[i]]
value <- x[[i]]
}
21.3.3 Unknown output length
Create a vector ‘output’ with unknown length and store results from a
for loop in it
means <- c(0,1,2)
output <- double()
for (i in seq_along(means)){
n <- sample(100,1)
output <- c(output,rnorm(n,means[[i]]))
}
str(output)
num [1:223] 2.4083 1.5499 0.6081 0.0844 0.7443 ...
output
[1] 2.40834490 1.54985893 0.60813582 0.08444820 0.74433326 0.23589873 0.13677913 -1.51138770
[9] -1.27301392 -1.76413099 -0.63497070 1.54956856 1.50375944 0.71571312 0.57801330 0.33952611
[17] 0.55112157 0.17114550 -0.45463725 1.16554397 0.69994812 -1.38517572 -0.21089332 0.59729886
[25] 0.96649672 0.27565281 2.07450311 0.94767106 -1.19450592 -1.17615918 -0.06135937 0.31565475
[33] 0.46863199 -2.44533524 1.06774440 -0.53263928 0.79354070 -1.03657232 -1.41232073 -1.32268012
[41] -0.80619868 -1.48689463 -1.54482571 1.03872808 -1.69903338 -1.03393281 1.63922764 1.21681751
[49] -1.50423215 0.08619177 -0.64176595 -0.43528690 2.35908464 -0.07057289 0.79716367 1.39285456
[57] 1.99268652 1.96561675 4.02439840 2.43533373 1.71102597 -1.18110259 0.59191107 1.10348971
[65] 0.73300412 0.12159976 0.27290089 1.61147640 0.05298932 0.28489074 0.51530326 2.48330465
[73] 1.37412695 1.25174310 1.51181660 0.51833391 1.22731542 0.94751275 2.14534938 3.73603533
[81] 2.50448575 -0.07692596 0.38360741 1.13672743 1.59343745 1.59795077 1.43992598 1.32878827
[89] 4.19537063 1.71172058 1.10573970 0.60868714 1.07205788 2.30309804 0.89510610 1.53102920
[97] 1.34831205 1.42881339 1.98257191 1.81541903 1.01326728 0.19392951 0.33627612 0.47158036
[105] 2.83906013 0.38940341 1.87692593 0.86561042 1.79769711 1.32415638 1.66865453 2.99697358
[113] 0.46228139 0.33591193 0.24631821 -0.43897896 -0.11946077 1.11951899 2.53562810 1.15236359
[121] -0.09132156 -1.16767499 0.69617045 1.53977593 -0.25021887 2.18171198 1.97165693 -0.23020504
[129] -0.44086089 1.15684137 2.24634334 1.55135055 2.31883196 0.67556263 0.08018167 1.13714558
[137] 1.91669266 1.11139134 2.44770118 0.19471109 2.17871196 4.23369053 -0.15692858 2.21704642
[145] 2.90250569 -0.47232133 1.53732998 0.24186549 2.47811777 3.33046579 2.23734182 0.96433477
[153] 2.53867102 2.31206058 1.97274207 2.13828221 2.57547609 1.13761332 1.18604257 1.61607018
[161] 2.37121047 2.84795864 3.27669950 1.60560488 2.10355804 3.13176141 3.26092255 0.75913241
[169] 0.30603003 2.52789744 0.60599780 2.73159671 0.64479899 -0.11114226 0.61712535 3.05991705
[177] 1.35211904 1.02472332 1.29546010 2.25444890 1.57803759 0.76376634 -0.73903589 2.93147906
[185] 2.07135777 1.17651876 1.52535341 1.73003413 0.57964605 1.54963744 2.16950772 1.45261534
[193] 2.78626480 2.75094794 2.99609585 2.54613765 1.71969902 4.22074008 2.97537992 1.47012235
[201] 1.27664953 2.99738151 2.29757845 1.24353502 2.34876874 0.40451196 0.98056608 2.15815198
[209] 1.21343698 2.37080467 1.56698860 3.19657275 0.47615986 0.53829438 3.06730806 1.80701539
[217] 2.43974635 1.12710760 2.72351352 2.57993773 3.26991511 1.74089217 0.42477812
Create a list ‘out’ with unknown length and store results from a for
loop in it
out <- vector("list",length(means))
for (i in seq_along(means)){
n <- sample(100,1)
out[[i]] <- rnorm(n,means[[i]])
}
str(out)
List of 3
$ : num [1:23] 0.109 0.669 -0.159 -0.325 -0.81 ...
$ : num [1:97] 1.57 2.15 2.41 1.75 1.1 ...
$ : num [1:9] 1.86 2.62 2.25 1.62 2.24 ...
str(unlist(out))
num [1:129] 0.109 0.669 -0.159 -0.325 -0.81 ...
21.3.4 Unknown sequence length
A while loop is also more general than a for loop, because you can
rewrite any for loop as a while loop, but you can’t rewrite every while
loop as for loop:
for (i in seq_along(x)) {
# body
}
# Equivalent to
i <- 1
while (i <= length(x)) {
# body
i <- i + 1
}
Herhow we could use a while loop to find how many tries it takes to
get three heads in a row:
flip <- function() sample(c("T", "H"), 1)
flips <- 0
nheads <- 0
while (nheads < 3) {
if (flip() == "H") {
nheads <- nheads + 1
} else {
nheads <- 0
}
flips <- flips + 1
}
flips
[1] 26
21.4 For loops vs. functionals
Compare for loop and functional approaches for calculating column
means in a data frame
df <- tibble(
a=rnorm(10),
b=rnorm(10),
c=rnorm(10),
d=rnorm(10)
)
Using for loop
output <- vector("double",length(df))
for (i in seq_along(df)){
output[[i]] <- mean(df[[i]])
}
output
[1] 0.476473102 0.001854536 0.558698854 0.220409290
Using functional approach with a custom function ‘col_mean’
col_mean <- function(df){
output <- vector("double",length(df))
for (i in seq_along(df)){
output[i] <- mean(df[[i]])
}
output
}
Define a function ‘col_median’ to calculate the median for each
column in the data frame ‘df’
col_median <- function(df){
output <- vector("double",hh(df))
for (i in seq_along(df)){
output[i] <- median(df[[i]])
}
output
}
col_sd <- function(df){
output <- vector("double",length(df))
for (i in seq_along(df)){
output[i] <- sd(df[[i]])
}
output
}
df
Define functions f1, f2, and f3 for calculating different powers of
absolute deviation from the mean
f1 <- function(x) abs(x-mean(x))^1
f2 <- function(x) abs(x-mean(x))^2
f3 <- function(x) abs(x-mean(x))^3
Define a function ‘f’ to calculate the absolute deviation from the
mean raised to a given power ‘i’
f <- function(x,i) abs(x-mean(x))^i
Define a function ‘col_summary’ to apply a summary function ‘fun’ to
each column of the data frame ‘df’
col_summary <- function(df, fun) {
out <- vector("double", length(df))
for (i in seq_along(df)) {
out[i] <- fun(df[[i]])
}
out
}
col_summary(df, median)
[1] 0.47327175 -0.06728873 0.35193999 0.23340748
col_summary(df, mean)
[1] 0.476473102 0.001854536 0.558698854 0.220409290
Demonstrate the use of ‘map_dbl’ from the ‘purrr’ package to apply a
function to each column of the data frame ‘df’
library(purrr)
head(df)
# Reference - for loop()
output <- vector("double",length(df))
for (i in seq_along(df)){
output[[i]] <- mean(df[[i]])
}
output
[1] 0.476473102 0.001854536 0.558698854 0.220409290
map_dbl(df,mean)
a b c d
0.476473102 0.001854536 0.558698854 0.220409290
map_dbl(df,median)
a b c d
0.47327175 -0.06728873 0.35193999 0.23340748
map_dbl(df,sd)
a b c d
1.1903576 0.9296956 0.7214505 0.6619177
df %>% map_dbl(mean)
a b c d
0.476473102 0.001854536 0.558698854 0.220409290
df %>% map_dbl(median)
a b c d
0.47327175 -0.06728873 0.35193999 0.23340748
df %>% map_dbl(sd)
a b c d
1.1903576 0.9296956 0.7214505 0.6619177
Demonstrate the use of ‘map_dbl’ from the ‘purrr’ package with
additional arguments
map_dbl(df,mean,trim=0.5)
a b c d
0.47327175 -0.06728873 0.35193999 0.23340748
Demonstrate the use of ‘map_int’ from the ‘purrr’ package to apply a
function that returns integers to each element of a list
z <- list(x=1:3,y=4:5)
z
$x
[1] 1 2 3
$y
[1] 4 5
map_int(z,length)
x y
3 2
21.5.1 Shortcuts
Demonstrate the use of ‘safely’ from the ‘purrr’ package to create a
safe version of a function
safe_log <- safely(log)
str(safe_log(10))
List of 2
$ result: num 2.3
$ error : NULL
str(safe_log("a"))
List of 2
$ result: NULL
$ error :List of 2
..$ message: chr "non-numeric argument to mathematical function"
..$ call : language .Primitive("log")(x, base)
..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
Demonstrate the use of ‘map’ from the ‘purrr’ package with ‘safely’
to apply a safe version of a function to each element of a list
x <- list(1,10,"a")
y <- x %>% map(safely(log))
str(y)
List of 3
$ :List of 2
..$ result: num 0
..$ error : NULL
$ :List of 2
..$ result: num 2.3
..$ error : NULL
$ :List of 2
..$ result: NULL
..$ error :List of 2
.. ..$ message: chr "non-numeric argument to mathematical function"
.. ..$ call : language .Primitive("log")(x, base)
.. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
Demonstrate the use of ‘transpose’ from the ‘purrr’ package to
transpose a list of lists
y <- x %>% transpose()
str(y)
List of 1
$ :List of 3
..$ : num 1
..$ : num 10
..$ : chr "a"
Demonstrate the use of error handling with ‘map_lgl’ and ‘is_null’
from the ‘purrr’ package
is_ok <- y$error %>% map_lgl(is_null)
x[!is_ok]
list()
# y$result[is_ok] %>% flatten_dbl()
Purrr provides two other useful adverbs:
x <- list(1,10,"a")
x %>% map_dbl(possibly(log,NA_real_))
[1] 0.000000 2.302585 NA
Demonstrate the use of ‘quietly’ from the ‘purrr’ package to suppress
errors and return results with warnings
x <- list(1,-1)
x %>% map(quietly(log)) %>% str()
List of 2
$ :List of 4
..$ result : num 0
..$ output : chr ""
..$ warnings: chr(0)
..$ messages: chr(0)
$ :List of 4
..$ result : num NaN
..$ output : chr ""
..$ warnings: chr "NaNs produced"
..$ messages: chr(0)
21.7 Mapping over multiple arguments
Generate random numbers from normal distributions with different
means using ‘map’ from the ‘purrr’ package
mu <- list(5,10,-3)
mu %>%
map(rnorm,n=5) %>%
str()
List of 3
$ : num [1:5] 5.11 4.32 5.23 5.09 4.36
$ : num [1:5] 9.66 9.56 12.07 9.6 7.39
$ : num [1:5] -2.57 -3.32 -4.2 -2.16 -2.32
Generate random numbers from normal distributions with different
means and standard deviations using ‘map2’ from the ‘purrr’ package
sigma <- list(1,5,10)
seq_along(mu) %>%
map(~rnorm(5,mu[[.]],sigma[[.]])) %>%
str()
List of 3
$ : num [1:5] 5.36 7.16 4.32 4.53 5.97
$ : num [1:5] 21.09 8.26 15.76 8.76 8.33
$ : num [1:5] -11.56 -11 -4.91 -10.91 -13.74
Define a custom ‘map2’ function to apply a binary function to
corresponding elements of two lists
map2(mu,sigma,rnorm,n=5) %>% str()
List of 3
$ : num [1:5] 4.85 4.27 4.85 3.68 3.97
$ : num [1:5] 15.4 16.3 13.5 11.2 11.4
$ : num [1:5] 4.561 -0.279 -11.559 0.298 -5.152
map2 <- function(x,y,f,...){
out <- vector("list",length(x))
for (i in seq_along(x)){
out[[i]] <- f(x[[i]],y[[i]],...)
}
out
}
Apply a function to corresponding elements of multiple lists using
‘pmap’ from the ‘purrr’ package
library(magrittr)
library(purrr)
n <- list(1,3,5)
args1 <- list(n,mu,sigma)
args1 %>%
pmap(rnorm) %>%
str()
List of 3
$ : num 3.71
$ : num [1:3] 6.11 12.73 9.31
$ : num [1:5] -12.562 -1.359 -0.823 -20.731 -12.676
Apply a function to corresponding elements of multiple lists with
named parameters using ‘pmap’ from the ‘purrr’ package
args2 <- list(mean=mu, sd=sigma,n=n)
args2 %>%
pmap(rnorm) %>%
str()
List of 3
$ : num 4.34
$ : num [1:3] 9.78 8.19 11.21
$ : num [1:5] 0.742 6.791 14.991 -1.91 0.511
Apply a function to corresponding rows of a data frame using ‘pmap’
from the ‘purrr’ package with a tibble
library(tidyverse)
parms <- tribble(
~mean,~sd,~n,
5,1,1,
10,5,3,
-3,10,5
)
parms %>%
pmap(rnorm)
[[1]]
[1] 3.243648
[[2]]
[1] 9.605242 11.025883 6.007252
[[3]]
[1] 16.669768 5.855557 -12.841618 -3.360844 5.139707
21.7.1 Involing different functions
Invoke different functions with different parameters using
‘invoke_map’ from the ‘purrr’ package
f <- c("runif","rnorm","rpois")
param <- list(
list(min=-1,max=1),
list(sd=5),
list(lambda=10)
)
f
[1] "runif" "rnorm" "rpois"
param
[[1]]
[[1]]$min
[1] -1
[[1]]$max
[1] 1
[[2]]
[[2]]$sd
[1] 5
[[3]]
[[3]]$lambda
[1] 10
To handle this case, you can use invoke_map():
invoke_map(f,param,n=5) %>%
str()
Warning: `invoke_map()` was deprecated in purrr 1.0.0.
Please use map() + exec() instead.
List of 3
$ : num [1:5] 0.7386 -0.9649 -0.0834 -0.4986 -0.9628
$ : num [1:5] 6.447 -6.531 -0.428 6.091 8.155
$ : int [1:5] 9 11 9 9 11
Invoke different functions with different parameters using ‘pmap’
from the ‘purrr’ package and a tibble
sim <- tribble(
~f, ~params,
"runif", list(min = -1, max = 1),
"rnorm", list(sd = 5),
"rpois", list(lambda = 10)
)
sim %>%
mutate(sim = invoke_map(f, params, n = 10))
21.8 Walk
Perform side effects without returning a value for each element of a
list using ‘walk’ from the ‘purrr’ package
x <- list(1,"a",3)
x %>%
walk(print)
[1] 1
[1] "a"
[1] 3
Perform side effects on each element of a list using ‘walk’ from the
‘purrr’ package, then save the results
library(ggplot2)
plots <- mtcars %>%
split(.$cyl) %>%
map(~ggplot(., aes(mpg, wt)) + geom_point())
paths <- stringr::str_c(names(plots), ".pdf")
pwalk(list(paths, plots), ggsave, path = tempdir())
Saving 7 x 7 in image
Retain or remove elements of a list based on a predicate function
using ‘keep’ and ‘discard’ from the ‘purrr’ package
iris %>%
keep(is.factor) %>%
str()
'data.frame': 150 obs. of 1 variable:
$ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
iris %>%
discard(is.factor) %>%
str()
'data.frame': 150 obs. of 4 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
library(tidyverse)
library(magrittr)
21.9.2 Reduce and accumulate
Iteratively combine elements of a list using a binary function with
‘reduce’ from the ‘purrr’ package
dfs <- list(
age=tibble(name="John",age=30),
sex=tibble(name=c("John","Mary"),sex=c("M","F")),
trt=tibble(name="Mary",treatment="A")
)
dfs %>% reduce(full_join)
Joining with `by = join_by(name)`Joining with `by = join_by(name)`
Find the intersection of multiple vectors using ‘reduce’ from the
‘purrr’ package
vs <- list(
c(1,3,5,6,10),
c(1,2,3,7,8,10),
c(1,2,3,4,8,9,10)
)
vs %>% reduce(intersect)
[1] 1 3 10
Iteratively apply a function to elements of a list using ‘accumulate’
from the ‘purrr’ package
x <- sample(10)
x
[1] 7 3 4 2 8 9 1 5 10 6
x %>% accumulate(`+`)
[1] 7 10 14 16 24 33 34 39 49 55
---
title: "Introduction to R"
author: "Bibek Sapkota"
output:
  pdf_document: default
  html_notebook: default
---

# Tibbles

Task 1:Loading the tidyverse package.
```{r}
library(tidyverse)
```
Task 2:Converting the iris dataset to a tibble.
```{r}
as_tibble(iris)
```
Task 3: Creating a tibble with columns "x," "y," and "z," where "x" ranges from 1 to 5, "y" is 1 for all rows, and "z" is calculated as the square of "x" plus "y" for each row.
```{r}
tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)
```

Task 4:Creating a tibble with columns named ":)" (representing "smile"), " " (representing "space"), and "2000" (representing "number").
```{r}
tb <- tibble(
  `:)` = "smile", 
  ` ` = "space",
  `2000` = "number"
)
tb
```

Task 5:Creating a tibble with columns "x," "y," and "z," containing the values "a," 2, 3.6 and "b," 1, 8.5 respectively.

```{r}
tribble(
  ~x, ~y, ~z,
  
  "a", 2, 3.6,
  "b", 1, 8.5
)
```

# Tibbles vs. data.frame

Task-1:Creating a tibble with columns "a," "b," "c," "d," and "e," containing 1000 randomly generated values for each column, representing dates, numbers, and letters.

```{r}
tibble(
  a = lubridate::now() + runif(1e3) * 86400,
  b = lubridate::today() + runif(1e3) * 30,
  c = 1:1e3,
  d = runif(1e3),
  e = sample(letters, 1e3, replace = TRUE)
)
```
Task 2: Tnstalling the package
```{r}
package_to_install <- c("nycflights13")

for (package_name in package_to_install) {
  if (!requireNamespace(package_name, quietly = TRUE)) {
    install.packages(package_name)
  }
}
library(nycflights13)
```

Task 3: Printing the first 10 rows of the nycflights13::flights dataset with unlimited width.
```{r}
nycflights13::flights %>% 
  print(n = 10, width = Inf)
```


Task 4: Viewing the nycflights13::flights dataset in a separate window for interactive exploration.
```{r}
nycflights13::flights %>% 
  View()
```

## Subsetting

Task 1: Creating a tibble named "df" with columns "x" and "y," then accessing the "x" column using different methods:
```{r}
df <- tibble(
  x = runif(5),#function that generates random numbers from a uniform distribution
  y = rnorm(5) # function that generates random numbers from a normal (Gaussian) distribution
)

df$x

df[["x"]]

df[[1]]

df %>% .$x


```
## Interacting with older code
Task-1: Determining the class of the object "tb" after converting it to a data frame.

```{r}
class(as.data.frame(tb))
```
## Exercises
Task-1: How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame).
```{r}
mtcars
```
Task-2
```{r}
# In a data.frame, extracting a non-existent column returns NULL,
# whereas in a tibble, it raises an error, providing immediate feedback.
# Other operations, such as extracting existing columns and subsets of columns,
# behave similarly across both data frames and tibbles.
# The default behavior of data.frames may lead to frustration
# due to the lack of error feedback for non-existent columns,
# potentially causing unnoticed mistakes and difficulty in debugging.
# In contrast, tibbles offer more robust behavior, enhancing data integrity
# and debugging efficiency.

df <- data.frame(abc = 1, xyz = "a")

# Extracting non-existent column in a data.frame
df$x  # Returns NULL

# Extracting existing column in a data.frame
df[, "xyz"]  # Returns a data frame with one column containing the values of the "xyz" column

# Extracting multiple columns in a data.frame
df[, c("abc", "xyz")]  # Returns a data frame containing only the specified columns

```
Task-3:If you have the name of a variable stored in an object, e.g. var <- "mpg", how can you extract the reference variable from a tibble?


# No pacakages
```{r}
# heights <- read_csv("data/heights.csv")
```

Task 1:  listing several tables: table1, table2, table3, table4a, and table4b.
```{r}
table1
table2
table3
table4a
table4b
```
Task 2: Calculating the rate by dividing the number of cases by the population and then multiplying by 10,000 for table1.
```{r}
table1 %>% 
  mutate(rate = cases / population * 10000)
```
Task 3: Counting the occurrences of each year in table1, using the 'cases' column as the weight.
```{r}
table1 %>% 
  count(year, wt = cases)
```
Task 4: Creating a ggplot using table1, plotting 'year' against 'cases' with lines grouped by 'country' and colored in grey50, along with points colored by 'country'.

```{r}
library(ggplot2)
ggplot(table1, aes(year, cases)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country))
```
#  Pivoting
## Longer
Task-1: referring to 'table4a'
```{r}
table4a
```
Task-2: Reshaping table4a using pivot_longer for columns '1999' and '2000' into 'year' and 'cases'.
```{r}
table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
```
Task-3: Reshaping table4b with pivot_longer for columns '1999' and '2000' into 'year' and 'population'.
```{r}
table4b %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")  #function transforms wide data into long format by stacking multiple columns into two: one for variable names and one for their corresponding values
```
Task-4: creating tidy datasets tidy4a and tidy4b by using pivot_longer on table4a and table4b to reshape them. Then, performing a left join on tidy4a and tidy4b.
```{r}
tidy4a <- table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
left_join(tidy4a, tidy4b)
```
## Wider
Task-1:Displaying table 2
```{r}
table2
```

Task-2: using the pivot_wider function on table2 to transform it from long to wide format, with 'type' becoming the new column names and 'count' being the corresponding values.
```{r}
table2 %>%
    pivot_wider(names_from = type, values_from = count)
```
###########
# Separating and uniting
## Separate
Task-1:displaying table3
```{r}
 table3
```

Task-2: Using the separate function on table3 splits the 'rate' column into two separate columns named 'cases' and 'population'.
```{r}
table3 %>% 
  separate(rate, into = c("cases", "population"))
```
Task-3:Using the separate function on table3 splits the 'rate' column into two separate columns named 'cases' and 'population', using the '/' character as the separator.
```{r}
table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")
```
Task-4:Using the separate function on table3 splits the 'rate' column into two separate columns named 'cases' and 'population', converting the resulting columns to their appropriate data types.
```{r}
table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE)
```
Task-5:  Applying the separate function to table3, the 'year' column is divided into two separate columns labeled 'century' and 'year', with the separator defined as the second character.
```{r}
table3 %>% 
  separate(year, into = c("century", "year"), sep = 2)
```
## Unite

Task-1: The unite function is applied to table5 to merge the 'century' and 'year' columns into a single column named 'new'.
```{r}
table5 %>% 
  unite(new, century, year)
```
Task-2: unite function is applied to table5 to merge the 'century' and 'year' columns into a single column named 'new', with no separator between them.
```{r}
table5 %>% 
  unite(new, century, year, sep = "")
```
#  Missing values
Task-1: Create a tibble named "stocks" with columns "year", "qtr" (quarter), and "return", having data for 2015 and 2016, with quarterly returns specified and some missing entries as NA.
```{r}
stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)
```


Task-2:Pivoting the "stocks" tibble to widen the data, extracting columns from the "year" variable and values from the "return" variable.
```{r}
stocks %>% 
  pivot_wider(names_from = year, values_from = return)
```

Task-3: pivot the data to a wide format with columns for each year's returns, then reshape it back to a long format, keeping only the non-missing values in the "return" column.
```{r}
stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(
    cols = c(`2015`, `2016`), 
    names_to = "year", 
    values_to = "return", 
    values_drop_na = TRUE
  )
```
Task-4:Filling missing combinations of "year" and "qtr" in the "stocks" dataset.
```{r}
stocks %>% 
  complete(year, qtr)
```
Task-5:Creating a tibble named "treatment" containing information about individuals, their treatment groups, and their responses, with some missing values for the "person" column.
```{r}
treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)
```
Task-6: Filling the missing values in the "person" column of the "treatment" tibble.
```{r}
treatment %>% 
  fill(person)

```

# Case Study
Task-1: Loading data set
```{r}
who
```

Task-2:Pivoting the "who" dataset from wide to long format, condensing columns into "cases" and capturing the original column names in "key".
```{r}
who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1
```
Task-3:Counting the occurrences of each "key" in the "who1" dataset.
```{r}
  who1 %>% 
    count(key)
```
Task-4:Replacing "newrel" with "new_rel" in the "key" column of the "who1" dataset to create "who2."
```{r}
who2 <- who1 %>% 
  mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2
```
Task-5:Separating the "key" column in the "who2" dataset into "new," "type," and "sexage" columns using "_" as the separator to create "who3."
```{r}
who3 <- who2 %>% 
  separate(key, c("new", "type", "sexage"), sep = "_")
who3
```
Task-6:Counting the occurrences of each unique value in the "new" column of the "who3" dataset.
```{r}
who3 %>% 
  count(new)
```

Task-7:Removing the "new", "iso2", and "iso3" columns from the "who3" dataset and assigning the result to "who4".
```{r}
who4 <- who3 %>% 
  select(-new, -iso2, -iso3)
```

Task-8:Splitting the "sexage" column of the "who4" dataset into "sex" and "age" columns, separated by the first character, and assigning the result to "who5".
```{r}
who5 <- who4 %>% 
  separate(sexage, c("sex", "age"), sep = 1)
who5
```
Task-9:Transforming the "who" dataset from wide to long format, adjusting column names, extracting meaningful variables, dropping unnecessary columns, and splitting the "sexage" column into "sex" and "age".
```{r}
who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  ) %>% 
  mutate(
    key = stringr::str_replace(key, "newrel", "new_rel")
  ) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)

```


## CH-13: Relational data

Task-1:Loding the libraries
```{r}
library(tidyverse)
library(nycflights13)
```

## nycflights13
Task-1: airlines data
```{r}
airlines
```

Task-2: airports data
```{r}
airports
```
Task-3: planes data
```{r}
planes 
```
Task-4: weather data
```{r}
weather 
```
################################################################################

# Keys 
Task-1Counting the occurrences of each tail number in the "planes" table and filtering for those with more than one occurrence.
```{r}
planes %>% 
  count(tailnum) %>% 
  filter(n > 1)
```

Task-2:Counting the occurrences of each combination of year, month, day, hour, and origin in the "weather" table and filtering for those with more than one occurrence.
```{r}
weather %>% 
  count(year, month, day, hour, origin) %>% 
  filter(n > 1)
```
Task-3:Counting the occurrences of each combination of year, month, day, and flight in the "flights" table and filtering for those with more than one occurrence.
```{r}
flights %>% 
  count(year, month, day, flight) %>% 
  filter(n > 1)
```
Task-4:Counting the occurrences of each combination of year, month, day, and tail number in the "flights" table and filtering for those with more than one occurrence.
```{r}
flights %>% 
  count(year, month, day, tailnum) %>% 
  filter(n > 1)
```
# Mutating joins

Task-1: Creating a subset of the "flights" table named "flights2" containing columns from "year" to "day", "hour", "origin", "dest", "tailnum", and "carrier".
```{r}
flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2
```
Task-2:Removing the "origin" and "dest" columns from "flights2" table and then performing a left join with the "airlines" table, using the "carrier" column as the key for matching.
```{r}
flights2 %>%
  select(-origin, -dest) %>% 
  left_join(airlines, by = "carrier")
```
Task-3:Shortening the command by removing "selecting" and directly "mutating" the "name" column with the corresponding airline names from the "airlines" table based on the "carrier" column.
```{r}
flights2 %>%
  select(-origin, -dest) %>% 
  mutate(name = airlines$name[match(carrier, airlines$carrier)])
```
#  Understanding joins
Task-1:Creating two tibbles, "x" and "y", each with a "key" column and an associated "val_x" or "val_y" column, respectively.
```{r}
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

x
y
```

## Inner join
Task-1:Joining tibbles `x` and `y` using an inner join operation based on the "key" column.
```{r}
x %>% 
  inner_join(y, by = "key")
```
## Duplicate keys
Task-1: Joining tibble x with tibble y using the common column "key".
```{r}
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2"
)
```

Task-2:Performing a left join between tibble `x` and tibble `y` based on the common column "key".
```{r}
left_join(x, y, by = "key")
```

Task-3:Creating two tibbles, `x` and `y`, with columns "key", "val_x", and "val_y", populated with corresponding values.
```{r}
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     3, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     2, "y3",
     3, "y4"
)
```

Task-4:Performing a left join on tibbles `x` and `y` using the "key" column as the join key.
```{r}
left_join(x, y, by = "key")
```
# Defining the key columns
Task-1:Performing a left join between the `flights2` tibble and the `weather` tibble.
```{r}
flights2 %>% 
  left_join(weather)
```
Task-2:Performing a left join between the `flights2` tibble and the `planes` tibble using the "tailnum" column as the key.
```{r}
flights2 %>% 
  left_join(planes, by = "tailnum")
```
Task-3:Performing a left join between the `flights2` tibble and the `airports` tibble, matching the "dest" column from `flights2` with the "faa" column from `airports`.
```{r}
flights2 %>% 
  left_join(airports, c("dest" = "faa"))
```

Task-4:Performing a left join between the `flights2` tibble and the `airports` tibble, matching the "origin" column from `flights2` with the "faa" column from `airports`.
```{r}
flights2 %>% 
  left_join(airports, c("origin" = "faa"))
```
#  Filtering joins
Task-1: Calculating the top 10 destinations by counting the occurrences in the "dest" column of the `flights` tibble, sorted in descending order, and then displaying the result.
```{r}
top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
top_dest
```
Task-2: Filtering the `flights` tibble to include only rows where the destination (`dest`) matches any of the top 10 destinations identified in the previous step.
```{r}
flights %>% 
  filter(dest %in% top_dest$dest)
#%in% operator in R is used to check if elements in one vector are present in another vector
```

Task-3: Selecting rows from the `flights` dataset where the destination airport matches one of the top 10 destinations previously identified.
```{r}
flights %>% 
  semi_join(top_dest)
```
Task-4: Filtering out flights with tail numbers present in the planes dataset and counting the occurrences of each unique tail number, sorting the result.
```{r}
flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(tailnum, sort = TRUE)
```
# Set operations
Task-1:creating two tibbles, df1 and df2, each with columns x and y, containing sample data.
```{r}
df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)
```

Task-2:performing set operations on the tibbles df1 and df2, including intersection, union, and set differences.
```{r}
intersect(df1, df2)
union(df1, df2)
setdiff(df1, df2)
setdiff(df2, df1)
```

# CH-14: Strings
Basic Info:string1 <- "This is a string"
           string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
           
Task-1:To include a literal single or double quote in a string you can use \ to “escape” it         
```{r}
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
```

Task-2: Understanding the character 
```{r}

x <- c("\"", "\\") #backslash is escape character
x
writeLines(x)
```
#  String length
Task-1:
```{r}
str_length(c("a", "R for data science", NA))
```

# Combining strings
Task-1:Combining  the strings
```{r}
str_c("x", "y")
str_c("x", "y", "z")
```
Task-2:Using the sep argument to control how they’re separated.
```{r}
str_c("x", "y", sep = ", ")
```
Task-3:Performing concatenation with "|" and "-" at both ends of each element of vector x, and replacing NA values with empty strings before concatenation.
```{r}
x <- c("abc", NA)
str_c("|-", x, "-|")
str_c("|-", str_replace_na(x), "-|")
```
Task-4: concatenating each element of the vector c("a", "b", "c") with a prefix "prefix-" and a suffix "-suffix".
```{r}
str_c("prefix-", c("a", "b", "c"), "-suffix")
```
Task-5: combining strings
```{r}
name <- "Hadley"
time_of_day <- "morning"
birthday <- FALSE

str_c(
  "Good ", time_of_day, " ", name,
  if (birthday) " and HAPPY BIRTHDAY",
  "."
)
```
# Subsetting strings
Task-1:Extracting the first three characters from each element in the vector `x` using `str_sub`.
```{r}
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
```
Task-2:negative numbers count backwards from end
```{r}
str_sub(x, -3, -1)
```
Task-3:using the assignment form of str_sub() to modify strings
```{r}
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
```

# Locales
Task-1:Changing the case 
```{r}
str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")
```
Task-2:Sorting the character vector x alphabetically using the English (en) locale and the Hawaiian (haw) locale.
```{r}
x <- c("apple", "eggplant", "banana")
str_sort(x, locale = "en") 
str_sort(x, locale = "haw") 
```
#  Matching patterns with regular expressions

## Basic matches
Task-1:Searching for the pattern "an" within each element of `x` and displaying the matches.
```{r}
x <- c("apple", "banana", "pear")
str_view(x, "an")
```

Task-2:Displaying elements in `x` where any character is followed by "a" and then any character.
```{r}
str_view(x, ".a.")
```
Task-3 
```{r}
# To create the regular expression, we need \\
dot <- "\\."

# But the expression itself only contains one:
writeLines(dot)

# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")

```
Task-4: Displaying elements in `x` where the sequence "\\" occurs.
```{r}
x <- "a\\b"
writeLines(x)

str_view(x, "\\\\")
```
##  Anchors
Task-1: Displaying elements in `x` that start with "a" and end with "a" respectively.
```{r}
x <- c("apple", "banana", "pear")
str_view(x, "^a")
str_view(x, "a$")
```
Task-2: Highlighting "apple" occurrences in `x` and instances where it's the only content.
```{r}
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")
```
## Character classes and alternatives

Task-1: Visualizing patterns matching "a.c", "a*c", and "a c" in the provided character vector.
```{r}
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
```
Task-2: Visualizing patterns matching "grey" or "gray" in the provided character vector.
```{r}
str_view(c("grey", "gray"), "gr(e|a)y")
```
## Repetition
Task-1:Identifying patterns "CC" or "C" in the string "1888 is the longest year in Roman numerals
```{r}
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
```
Task-2: Viewing the pattern "CC"
```{r}
str_view(x, "CC+")
```
Task-3: Viewing the pattern "C[LX]+"
```{r}
str_view(x, 'C[LX]+')
```
Task-4:Viewing the pattern "C{2},C{2,},c{2,3}"
```{r}
str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")
```
## Grouping and backreferences
Task-1:Grouping
```{r}
str_view(fruit, "(..)\\1", match = TRUE)
```

## Detect matches
Task-1: Checking for the presence of the letter "e" in each word 
```{r}
x <- c("apple", "banana", "pear")
str_detect(x, "e")
```

Task-2:Checking how many common words start with t
```{r}
sum(str_detect(words, "^t"))
```

Task-3: Checking proportion of common words end with a vowel
```{r}
mean(str_detect(words, "[aeiou]$"))
```
Task-4:Finding all words containing at least one vowel, and negate
```{r}
no_vowels_1 <- !str_detect(words, "[aeiou]")
```

Task-5:Finding all words consisting only of consonants (non-vowels)
```{r}
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
```

Task-6: Filtering words that end with the letter "x" from a list of words.
```{r}
words[str_detect(words, "x$")]
str_subset(words, "x$")
```
Task-7: Filtering a tibble for words that end with "x".
```{r}
df <- tibble(
  word = words, 
  i = seq_along(word)
)
df %>% 
  filter(str_detect(word, "x$"))
```

Task-8:Counting the occurrences of "a" in each element of a character vector.
```{r}
x <- c("apple", "banana", "pear")
str_count(x, "a")
```
Task-9: Seeing average of how many vowels per word
```{r}
mean(str_count(words, "[aeiou]"))
```
Task-10: Adding columns to a tibble to count vowels and consonants in each word.
```{r}
df %>% 
  mutate(
    vowels = str_count(word, "[aeiou]"),
    consonants = str_count(word, "[^aeiou]")
  )
```

Task-11:Counting "aba" occurrences in "abababa" and showing all "aba" instances.
```{r}
str_count("abababa", "aba")
str_view_all("abababa", "aba")
```
## Extract matches
Task-1: Displaying the length of sentences and showing the first few sentences.
```{r}
length(sentences)
head(sentences)
```

Task-2: Creating a string pattern to match colors by concatenating them with a pipe delimiter.
```{r}
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
```
Task-3: Filter sentences for colors and extract matches, showing the first few.
```{r}
has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
```
Task-4:Showing all sentences containing multiple colors and highlight the matches.
```{r}
more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)
```
Task-5:Extracting all color matches from the subset of sentences containing multiple colors.
```{r}
str_extract(more, colour_match)
```

Task-6:Extracting all occurrences of colors from the subset of sentences containing multiple colors.
```{r}
str_extract_all(more, colour_match)
```
Task-7: Extracting colors from sentences with multiple colors and simplify, also extract lowercase letters from each element in x and simplify.
```{r}
str_extract_all(more, colour_match, simplify = TRUE)
x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
```
## Grouped matches 
Task-1: Extracting sentences containing nouns defined by a pattern, then extracts the nouns from those sentences.
```{r}
noun <- "(a|the) ([^ ]+)"

has_noun <- sentences %>%
  str_subset(noun) %>%
  head(10)
has_noun %>% 
  str_extract(noun)
```

Task-2:
```{r}
has_noun %>% 
  str_match(noun)
```
Task-3:Creating a tibble with columns 'article' and 'noun' extracted from sentences based on a pattern.
```{r}
tibble(sentence = sentences) %>% 
  tidyr::extract(
    sentence, c("article", "noun"), "(a|the) ([^ ]+)", 
    remove = FALSE
  )
```
## Replacing matches
Task-1: Replacing the first vowel in each word of x with a hyphen.
        Replacing all vowels in each word of x with a hyphen.
```{r}
x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
str_replace_all(x, "[aeiou]", "-")
```
Task-2: Replacing numeric values in x with their corresponding word representations.
```{r}
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
```
Task-3:Reordering words in sentences by swapping the second and third word positions.
```{r}
sentences %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
  head(5)
```
# Splitting
Task-1: Splitting the first five sentences into words.
```{r}
sentences %>%
  head(5) %>% 
  str_split(" ")
```
Task-2:Splitting the string 'a|b|c|d' by '|' into a vector of elements.
```{r}
"a|b|c|d" %>% 
  str_split("\\|") %>% 
  .[[1]]
```
Task-3:Splitting the first 5 sentences by space into a matrix of words.
```{r}
sentences %>%
  head(5) %>% 
  str_split(" ", simplify = TRUE)
```
Task-4:Splitting each field string into two parts at the first occurrence of ': '.
```{r}
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
```
Task-5: Display word boundaries, split by spaces, and split by word boundaries, respectively.
```{r}
x <- "This is a sentence.  This is another sentence."
str_view_all(x, boundary("word"))
str_split(x, " ")[[1]]
str_split(x, boundary("word"))[[1]]
```
# Other types of pattern

Task-1: 
```{r}
# The regular call:
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))
```
Task-2:Visualizing occurrences of "banana" in different case variations.
```{r}
bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
str_view(bananas, regex("banana", ignore_case = TRUE))
```
Task-3: Extracting all lines starting with "Line" from the text.
```{r}
x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
```
Task-4: Extracting all occurrences of lines starting with "Line" from the text, considering each line separately.
```{r}
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
```
Task-5:Creating a regular expression pattern for phone numbers, allowing for variations in formatting, and attempting to match it against the provided phone number.
```{r}
phone <- regex("
  \\(?     # optional opening parens
  (\\d{3}) # area code
  [) -]?   # optional closing parens, space, or dash
  (\\d{3}) # another three numbers
  [ -]?    # optional space or dash
  (\\d{3}) # three more numbers
  ", comments = TRUE)

str_match("514-791-8141", phone)
```
Task-6:Installling the package and Benchmarking string detection in "sentences" using fixed and regex patterns 20 times each, comparing performance with microbenchmark.
```{r}

package_to_install <- c("microbenchmark")

for (package_name in package_to_install) {
  if (!requireNamespace(package_name, quietly = TRUE)) {
    install.packages(package_name)
  }
}
library(microbenchmark)

microbenchmark::microbenchmark(
  fixed = str_detect(sentences, fixed("the")),
  regex = str_detect(sentences, "the"),
  times = 20
  )
```
Task-7:Starting with a1 being "\u00e1" and a2 being "a\u0301", both representing the character "á", they are compared for equality.
```{r}
a1 <- "\u00e1"
a2 <- "a\u0301"
c(a1, a2)
a1 == a2
```
Task-8: Checking if `a1` contains the fixed string `a2` returns `FALSE`, whereas using collation rules returns `TRUE`.
```{r}
str_detect(a1, fixed(a2))

str_detect(a1, coll(a2))
```
Task-9:Creating a vector `i` with different forms of the letter "i", then using `str_subset` to filter them based on collation.
```{r}
i <- c("I", "İ", "i", "ı")
i
str_subset(i, coll("i", ignore_case = TRUE))
str_subset(i, coll("i", ignore_case = TRUE, locale = "tr"))
```

Task-10: Fetching locale information.
```{r}
stringi::stri_locale_info()

```
Task-11:Visualizing word boundaries and extracts all words from the string.
```{r}
x <- "This is a sentence."
str_view_all(x, boundary("word"))
str_extract_all(x, boundary("word"))
```
# CH-15: Factors
## Creatig factors
Task-1:Adding character vector in variable x1
```{r}
x1 <- c("Dec", "Apr", "Jan", "Mar")
```

Task-2:Adding character vector in variable x2
```{r}
x2 <- c("Dec", "Apr", "Jam", "Mar")
```

Task-3:Sorting X1 
```{r}
sort(x1)

```
Task-4:Adding Character vector in month_levels
```{r}
month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
```

Task-5:Assigning the factor levels to the variable x1, using the predefined month_levels.
```{r}
y1 <- factor(x1, levels = month_levels)
y
```
Task-6:Sorting the factor levels in y1.
```{r}
sort(y1)
```
Task-7:creating a factor y2 from x2 with custom levels specified by month_levels.
```{r}
y2 <- factor(x2, levels = month_levels)
y2
```
Task-8:parsing the values in x2 as factors
```{r}
y2 <- parse_factor(x2, levels = month_levels)
```
Task-9: omitting the levels.
```{r}
factor(x1)
```
Task-10:Creating a factor f1 from the values in x1, using the unique values of x1 as levels.
```{r}
f1 <- factor(x1, levels = unique(x1))
f1
```
Task-11: creating a factor f2 from the values in x1, ordering them according to their appearance in x1.
```{r}
f2 <- x1 %>% factor() %>% fct_inorder()
f2
```
Task-12:Omitting levels2
```{r}
levels(f2)
```
# General Social Survey
Task-1:Loading datasets
```{r}
gss_cat
```

Task-2:Seeing levels through count()
```{r}
gss_cat %>%
  count(race)
```
Task-3:Also seeing through bar()
```{r}
ggplot(gss_cat, aes(race)) +
  geom_bar()
```

Task-4:Generating a bar plot using ggplot()
```{r}
ggplot(gss_cat,aes(race))+geom_bar()+scale_x_discrete(drop=FALSE)
```
# Modifying factor order
Task-1:calculating summary statistics and then creating scatter plot 
```{r}
relig_summary <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
```

Task-2:Generating a scatter plot using `ggplot`, where the x-axis represents the mean TV hours (`tvhours`), and the y-axis represents the `relig` variable reordered by mean TV hours.
```{r}
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()
```
Task-3:Creating a scatter plot using ggplot.
```{r}
relig_summary %>%
  mutate(relig = fct_reorder(relig, tvhours)) %>%
  ggplot(aes(tvhours, relig)) +
    geom_point()
```
Task-4:Generating a scatter plot using ggplot
```{r}
rincome_summary <- gss_cat %>%
  group_by(rincome) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()
```
Task-5: creates a scatter plot of the average age by income level, with "Not applicable" as the reference level for income
```{r}
ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
  geom_point()
```
Task-6:calculating the proportion of each marital status group across different age groups and creates a line plot showing the distribution of marital status proportions by age.
```{r}
by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n))

ggplot(by_age, aes(age, prop, colour = marital)) +
  geom_line(na.rm = TRUE)

ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
  geom_line() +
  labs(colour = "marital")
```
Task-7: Adjusting the order of the "marital" variable based on frequency and then reverses the order before generating a bar plot illustrating the distribution of marital status.
```{r}
gss_cat %>%
  mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
  ggplot(aes(marital)) +
    geom_bar()
```
# Modifying factor levels

Task-1: counting the frequency of each unique value in the "partyid" variable of the "gss_cat" dataset.
```{r}
gss_cat%>%count(partyid)
```
Task-2:Recording the levels of the "partyid" variable in the "gss_cat" dataset and then counts the frequency of each unique recorded value.
```{r}
gss_cat %>%
  mutate( partyid=fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat"
    ))%>%
  count(partyid)
```
Task-3:Recategorizing and counting party affiliations in the "gss_cat" dataset.
```{r}
gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat",
    "Other"                 = "No answer",
    "Other"                 = "Don't know",
    "Other"                 = "Other party"
  )) %>%
  count(partyid)
```
Task-4: Collapsing categories within the "partyid" variable in the "gss_cat" dataset into broader groups and then counting the frequency of each collapsed category.
```{r}
gss_cat%>%
  mutate(partyid=fct_collapse(partyid,
                              other=c("No answer", "Don't know", "Other party"),
                              rep=c("Strong republican", "Not str republican"),
                              ind=c("Ind,near rep", "Independent", "Ind,near dem"),
                              dem=c("Not str democrat", "Strong democrat"))) %>%
  count(partyid)
```

Task-5:Counting and aggregating religious affiliations in the "gss_cat" dataset after lumping together less frequent categories.
```{r}
gss_cat %>%
  mutate(relig = fct_lump(relig)) %>%
  count(relig)
```
Task-6:"Summarizing religious affiliations after lumping infrequent categories and sort."
```{r}
gss_cat %>%
  mutate(relig = fct_lump(relig, n = 10)) %>%
  count(relig, sort = TRUE) %>%
  print(n = Inf)
```

# CH-Data and Times

Task-1:Loading library
```{r}
library(tidyverse)

library(lubridate)
library(nycflights13)
```

## Creating dates/times
Task-1: Printing  current date or date-time
```{r}
today()
now()
```
## Form strings
Task-2:converting date strings to date objects in different formats.
```{r}
ymd("2017-01-31")
mdy("January 31st, 2017")
dmy("31-Jan-2017")
```
```{r}
ymd(20170131)
```

```{r}
ymd_hms("2017-01-31 20:11:59")
mdy_hm("01/31/2017 08:01")
```


```{r}
flights %>% 
  select(year, month, day, hour, minute)
flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_datetime(year, month, day, hour, minute))
```
Task: Creating date-time objects from hour-minute time data in the 'flights' dataset and filtering out rows with missing departure or arrival times
```{r}
make_datetime_100 <- function(year, month, day, time) {
  make_datetime(year, month, day, time %/% 100, time %% 100)
}

flights_dt <- flights %>% 
  filter(!is.na(dep_time), !is.na(arr_time)) %>% 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) %>% 
  select(origin, dest, ends_with("delay"), ends_with("time"))

flights_dt
```
Task: Plotting the frequency of flights over time using departure date-time

```{r}
flights_dt %>% 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 86400) 
```
Task: Plotting the frequency of flights over time for a specific period using departure date-time

```{r}
flights_dt %>% 
  filter(dep_time < ymd(20130102)) %>% 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 600) # 600 s = 10 minutes
```
Task: to convert today's date to date-time object
```{r}
as_datetime(today())

as_date(now())

as_date(365 * 10 + 2)

```
Date-time components
Task: Extracting various components of a date-time object
```{r}
datetime <- ymd_hms("2016-07-08 12:34:56")
year(datetime)
month(datetime)
mday(datetime)
yday(datetime)
wday(datetime)
month(datetime, label = TRUE)
wday(datetime, label = TRUE, abbr = FALSE)
```
Task: Plotting the frequency of flights by day of the week
```{r}
flights_dt %>% 
  mutate(wday = wday(dep_time, label = TRUE)) %>% 
  ggplot(aes(x = wday)) +
    geom_bar()
```
Task: Plotting average delay by minute of departure time
```{r}
flights_dt %>% 
  mutate(minute = minute(dep_time)) %>% 
  group_by(minute) %>% 
  summarise(
    avg_delay = mean(arr_delay, na.rm = TRUE),
    n = n()) %>% 
  ggplot(aes(minute, avg_delay)) +
    geom_line()
```
Task: Plotting average delay by minute of scheduled departure time

```{r}
sched_dep <- flights_dt %>% 
  mutate(minute = minute(sched_dep_time)) %>% 
  group_by(minute) %>% 
  summarise(
    avg_delay = mean(arr_delay, na.rm = TRUE),
    n = n())

ggplot(sched_dep, aes(minute, avg_delay)) +
  geom_line()
```
Task: Plotting the number of flights by minute of scheduled departure time
```{r}
ggplot(sched_dep, aes(minute, n)) +
  geom_line()
```
Rounding
Task:Plotting the number of flights by week, rounding to the nearest week

```{r}
flights_dt %>% 
  count(week = floor_date(dep_time, "week")) %>% 
  ggplot(aes(week, n)) +
    geom_line()

```
setting compounds
Task: Setting up a date-time object
```{r}
(datetime <- ymd_hms("2016-07-08 12:34:56"))
year(datetime) <- 2020
datetime
month(datetime) <- 01
datetime
hour(datetime) <- hour(datetime) + 1
datetime
update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
```
```{r}
ymd("2015-02-01") %>% 
  update(mday = 30)
ymd("2015-02-01") %>% 
  update(hour = 400)
```
Task: Creating a new variable 'dep_hour' by updating the 'dep_time' to the first day of the year

```{r}
flights_dt %>% 
  mutate(dep_hour = update(dep_time, yday = 1)) %>% 
  ggplot(aes(dep_hour)) +
    geom_freqpoly(binwidth = 300)
```
Time Spans
Compute the age of a person based on their birthdate and today's date
```{r}
h_age <- today() - ymd(19791014)
h_age
as.duration(h_age)
```
```{r}
dseconds(15)
dminutes(10)
dhours(c(12, 24))
ddays(0:5)
dweeks(3)
dyears(1)
```
```{r}
2 * dyears(1)
dyears(1) + dweeks(12) + dhours(15)
tomorrow <- today() + ddays(1)
last_year <- today() - dyears(1)
one_pm <- ymd_hms("2016-03-12 13:00:00", tz = "America/New_York")
one_pm
one_pm + ddays(1)
```
Periods
Create period objects representing different time spans and Perform arithmetic operations with period objects
```{r}
one_pm
one_om = days(1)
```
```{r}
seconds(15)
minutes(10)
hours(c(12, 24))
days(7)
months(1:6)
weeks(3)
years(1)
```

```{r}
10 * (months(6) + days(1))
days(50) + hours(25) + minutes(2)
```

```{r}
ymd("2016-01-01") + dyears(1)
ymd("2016-01-01") + years(1)
one_pm + ddays(1)
one_pm + days(1)
```

Filter flights where arrival time is before departure time
```{r}
flights_dt %>% 
  filter(arr_time < dep_time) 
```

Update flights data to correct overnight flights
```{r}
flights_dt <- flights_dt %>% 
  mutate(
    overnight = arr_time < dep_time,
    arr_time = arr_time + days(overnight * 1),
    sched_arr_time = sched_arr_time + days(overnight * 1)
  )
```

Filter flights where overnight condition is true and arrival time is before departure time
```{r}
flights_dt %>% 
  filter(overnight, arr_time < dep_time) 
```

Intervals
Calculate the ratio of one year in days
```{r}
years(1) / days(1)
next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)
(today() %--% next_year) %/% days(1)
```

Display time zone information
```{r}
Sys.timezone()
length(OlsonNames())
head(OlsonNames())
```
```{r}
(x1 <- ymd_hms("2015-06-01 12:00:00", tz = "America/New_York"))
(x2 <- ymd_hms("2015-06-01 18:00:00", tz = "Europe/Copenhagen"))
(x3 <- ymd_hms("2015-06-02 04:00:00", tz = "Pacific/Auckland"))
```
```{r}
x1 - x2
x1 - x3
```
# Pipes

Task: To import the required library
```{r}
packages_to_install <- c("tidyverse", "pryr")
for (package_name in packages_to_install) {
  if (!requireNamespace(package_name, quietly = TRUE)) {
    install.packages(package_name)
  }
  library(package_name, character.only = TRUE)
}

library(magrittr)
```

Create diamond data and calculate the object sizes
```{r}
diamonds <- ggplot2::diamonds
diamonds2 <- diamonds %>% 
  dplyr::mutate(price_per_carat = price / carat)

pryr::object_size(diamonds)
pryr::object_size(diamonds2)
pryr::object_size(diamonds, diamonds2)
```
Functions
Normalize the columns of a data frame
```{r}
df <- tibble::tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

df$a <- (df$a - min(df$a, na.rm = TRUE)) / 
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / 
  (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / 
  (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) / 
  (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
```
Normalize a single column of a data frame
```{r}
(df$a - min(df$a, na.rm = TRUE)) /
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
```
```{r}
x <- df$a
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
```
```{r}
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
```
```{r}
rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}
rescale01(c(0, 5, 10))
```

Rescale a vector to the range [0, 1]
```{r}
rescale01(c(-10, 0, 10))
rescale01(c(1, 2, 3, NA, 5))
```

Rescale each column of a DataFrame to the range [0, 1]
```{r}
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```
```{r}
x <- c(1:10, Inf)
rescale01(x)
```

Define the rescale01 function and apply it
```{r}
rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE, finite = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}
rescale01(x)
```

Load required libraries and packages
```{r}
library(tidyverse)
library(purrr)
library(magrittr)

# install.packages("pryr")
library(pryr)
```

## 18.2 Piping alternatives
This is a popular Children’s poem that is accompanied by hand actions.We’ll start by defining an object to represent little bunny Foo Foo:

```{r 18.2-1}
# foo_foo <- little_bunny()
```
### 18.2.1 Intermediate steps
The simplest approach is to save each step as a new object:

```{r 18.2.1-1}
# foo_foo_1 <- hop(foo_foo,through=forest)
# foo_foo_2 <- scoop(foo_foo_1, up = field_mice)
# foo_foo_3 <- bop(foo_foo_2, on = head)
```

Create diamonds dataset and calculate price per carat
```{r 18.2.1-2}
diamonds <- ggplot2::diamonds
diamonds2 <- diamonds %>% 
  dplyr::mutate(price_per_carat=price/carat)

pryr::object_size(diamonds)
pryr::object_size(diamonds2)
pryr::object_size(diamonds,diamonds2)
```

Introduce NA value into diamonds$carat and check object sizes
```{r 18.2.1-3}
diamonds$carat[1] <- NA
pryr::object_size(diamonds)
pryr::object_size(diamonds2)
pryr::object_size(diamonds,diamonds2)
```

### 18.2.2 Overwrite the original
Instead of creating intermediate objects at each step, we could overwrite the original object:
```{r 18.2.2-1}
# foo_foo <- hop(foo_foo, through = forest)
# foo_foo <- scoop(foo_foo, up = field_mice)
# foo_foo <- bop(foo_foo, on = head)
```
### 18.2.3 Function composition
Another approach is to abandon assignment and just string the function calls together:
```{r 18.2.3-1}
# bop(
#   scoop(
#     hop(foo_foo, through = forest),
#     up = field_mice
#   ), 
#   on = head
# )
```

Here the disadvantage is that you have to read from inside-out, from right-to-left, and that the arguments end up spread far apart (evocatively called the dagwood sandwhich problem). In short, this code is hard for a human to consume.

### 18.2.4 Use the pipe
Finally, we can use the pipe:
```{r 18.2.4-1}
# foo_foo %>%
#   hop(through = forest) %>%
#   scoop(up = field_mice) %>%
#   bop(on = head)
```

```{r 18.2.4-2}
# my_pipe <- function(.) {
#   . <- hop(., through = forest)
#   . <- scoop(., up = field_mice)
#   bop(., on = head)
# }
# my_pipe(foo_foo)
```
TASK:  Functions that use the current environment. For example, `assign()` will create a new variable with the given name in the current environment:
```{r 18.2.4-3}
assign("x",10)
x

"x" %>% assign(100)
x
```
Assign value to "x" in the specified environment and check its value and Generate random numbers, create a matrix, plot it, and inspect its structure
```{r 18.2.4-4}
env <- environment()
"x" %>% assign(100,envir=env)
x
```
```{r 18.4-1}
rnorm(100) %>% 
  matrix(ncol=2) %>% 
  plot() %>% 
  str()

rnorm(100) %>% 
  matrix(ncol=2) %>% 
  plot() %>% 
  str()

ndist <- rnorm(100000)
hist(ndist)
```

Calculate the correlation between two variables in mtcars dataset
```{r 18.4-2}
mtcars %$%
  cor(disp, mpg)
```

- For assignment magrittr provides the `%<>%` operator which allows you to replace code like:
```{r 18.4-3}
mtcars <- mtcars %>% 
  transform(cyl=cyl*2)
```


```{r 18.4-4}
mtcars %<>% transform(cyl=cyl*2)
```
# Chapter 19 Functions
## 19.1 Introduction

## 19.2 When should you write a function?
```{r 19.2-1}
df <- tibble::tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
df

df$a <- (df$a - min(df$a, na.rm = TRUE)) / 
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / 
  (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / 
  (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) / 
  (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
```

Rescale a single variable in a data frame
```{r 19.2-2}
(df$a - min(df$a, na.rm = TRUE)) /
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
```

Rescale a single variable without creating a new object
```{r}
x <- df$a
(x - min(x, na.rm = T)) / (max(x, na.rm = T)-min(x, na.rm = T))
```

Task: There is some duplication in this code. We’re computing the range of the data three times, so it makes sense to do it in one step:
```{r}
rng <- range(x, na.rm = T)
(x-rng[1])/(rng[2]-rng[1])
```

Pulling out intermediate calculations into named variables is a good practice because it makes it more clear what the code is doing. Now that I’ve simplified the code, and checked that it still works, I can turn it into a function:
```{r 19.2-5}
rescale01 <- function(x){
  rng <- range(x, na.rm = T)
  (x-rng[1])/(rng[2]-rng[1])
}
rescale01(c(0,5,10))
```

Test the rescale01 function with various inputs
```{r 19.2-6}
rescale01(c(-10,0,10))
rescale01(c(1,2,3,NA,5))
```

We can simplify the original example now that we have a function:
```{r 19.2-7}
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```

Rescale a vector with infinite values
```{r 19.2-8}
x <- c(1:10,Inf)
rescale01(x)
```

Because we’ve extracted the code into a function, we only need to make the fix in one place:
```{r 19.2-9}
rescale01 <- function(x){
  rng <- range(x,na.rm=T,finite=T)
  (x-rng[1])/(rng[2]-rng[1])
}
rescale01(x)
```

## 19.4 Conditional execution
An `if` statement allows you to conditionally execute code. 
It looks like this:
```{r}
# if (condition) {
  # code executed when condition is TRUE
# } else {
  # code executed when condition is FALSE
# }
```

Define a function to check if an object has names
```{r}
has_name <- function(x){
  nms <- names(x)
  if(is.null(nms)){
    rep(FALSE,length(x))
  }else {
    !is.na(nms) & nms !=""
  }
}
```
### 19.4.1 Conditions
how if condition works with warnings
```{r 19.4.1-1}
# if (c(TRUE,FALSE)){}
#> Warning in if (c(TRUE, FALSE)) {: the condition has length > 1 and only the
#> first element will be used
#> NULL

# if (NA) {}
```
Check if two objects are identical
```{r}
identical(0L,0)
x <- sqrt(2)^2
x==2
x-2
```

### 19.4.2 Multiple conditions
You can chain multiple if statement together:
```{r 19.4.2-1}
# if (this) {
#   # do that
# } else if (that) {
#   # do something else
# } else {
#   # 
# }
```

```{r 19.4.2-2}
#> function(x, y, op) {
#>   switch(op,
#>     plus = x + y,
#>     minus = x - y,
#>     times = x * y,
#>     divide = x / y,
#>     stop("Unknown op!")
#>   )
#> }
```
### 19.4.3 Code style

Good practice for writing if statements
```{r 19.4.3-1}
# Good
# if (y < 0 && debug) {
#   message("Y is negative")
# }
# 
# if (y == 0) {
#   log(x)
# } else {
#   y ^ x
# }
# 
# # Bad
# if (y < 0 && debug)
# message("Y is negative")
# 
# if (y == 0) {
#   log(x)
# } 
# else {
#   y ^ x
# }
```

It’s ok to drop the curly braces if you have a very short if statement that can fit on one line: 
```{r 19.4.3-2}
y <- 10
x <- if (y < 20) "Too low" else "Too high"
```

I recommend this only for very brief `if` statements. Otherwise, the full form is easier to read:
```{r}
if (y < 20) {
  x <- "Too low" 
} else {
  x <- "Too high"
}
```

## 19.5 Function arguments 

```{r}
# Compute confidence interval around mean using normal approximation
mean_ci <- function(x, conf = 0.95) {
  se <- sd(x) / sqrt(length(x))
  alpha <- 1 - conf
  mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}

x <- runif(100)
mean_ci(x)

mean_ci(x, conf = 0.99)

```


### 19.5.1 Choosing names
### 19.5.2 Cheking values
```{r}
wt_mean <- function(x, w) {
  sum(x * w) / sum(w)
}
wt_var <- function(x, w) {
  mu <- wt_mean(x, w)
  sum(w * (x - mu) ^ 2) / sum(w)
}
wt_sd <- function(x, w) {
  sqrt(wt_var(x, w))
}
```

What happens if x and w are not the same length?
```{r}
wt_mean(1:6, 1:3)

```

In this case, because of R’s vector recycling rules, we don’t get an error.

It’s good practice to check important preconditions, and throw an error (with `stop()`), if they are not true:
```{r}
wt_mean <- function(x, w) {
  if (length(x) != length(w)) {
    stop("`x` and `w` must be the same length", call. = FALSE)
  }
  sum(w * x) / sum(w)
}
```

```{r}
wt_mean <- function(x, w, na.rm = FALSE) {
  if (!is.logical(na.rm)) {
    stop("`na.rm` must be logical")
  }
  if (length(na.rm) != 1) {
    stop("`na.rm` must be length 1")
  }
  if (length(x) != length(w)) {
    stop("`x` and `w` must be the same length", call. = FALSE)
  }
  
  if (na.rm) {
    miss <- is.na(x) | is.na(w)
    x <- x[!miss]
    w <- w[!miss]
  }
  sum(w * x) / sum(w)
}
```

This is a lot of extra work for little additional gain. A useful compromise is the built-in `stopifnot()`: it checks that each argument is `TRUE`, and produces a generic error message if not.
```{r}
wt_mean <- function(x, w, na.rm = FALSE) {
  stopifnot(is.logical(na.rm), length(na.rm) == 1)
  stopifnot(length(x) == length(w))
  
  if (na.rm) {
    miss <- is.na(x) | is.na(w)
    x <- x[!miss]
    w <- w[!miss]
  }
  sum(w * x) / sum(w)
}
```
### 19.5.3 Dot-dot-dot(...)
Many functions in R take an arbitrary number of inputs: 
```{r 19.5.3-1}
sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
stringr::str_c("a", "b", "c", "d", "e", "f")
```

Define a function to concatenate strings with commas
```{r 19.5.3-2}
commas <- function(...) stringr::str_c(..., collapse = ", ")
commas(letters[1:10])


rule <- function(..., pad = "-") {
  title <- paste0(...)
  width <- getOption("width") - nchar(title) - 5
  cat(title, " ", stringr::str_dup(pad, width), "\n", sep = "")
}
rule("Important output")
```
```{r}
x <- c(1,2)
sum(x,na.rm=T)
```

Define a function 'complicated_function' with conditions to return 0 if 'x' or 'y' is empty
```{r}
complicated_function <- function(x,y,z){
  if (lenth(x)==0 || length(y)==0){
    return(0)
  }
}
```

Improve readability of if-else blocks by using early return for simple cases
```{r}
f <- function() {
  if (x) {
    # Do 
    # something
    # that
    # takes
    # many
    # lines
    # to
    # express
  } else {
    # return something short
  }
}
```

But if the first block is very long, by the time you get to the else, you’ve forgotten the condition. One way to rewrite it is to use an early return for the simple case:
```{r 19.6.1-3}
f <- function() {
  if (!x) {
    return(something_short)
  }

  # Do 
  # something
  # that
  # takes
  # many
  # lines
  # to
  # express
}
```

This tends to make the code easier to understand, because you don't need quite so much context to understand it.

### 19.6.2 Writing pipeable functions
Define a function to show the count of missing values in a data frame
```{r 19.6.2-1}
show_missing <- function(df){
  n <- sum(is.na(df))
  cat("Missing values:",n,"\n",sep="")
  
  invisible(df)
}
```

If we call it interatively, the `invisible()` means that the input `df` does not get printed out: 
```{r}
show_missing(mtcars)
```

But it's still there, it's not just printed by default:
```{r}
x <- show_missing(mtcars)
class(x)
dim(x)
```

And we can still use it in a pipe:
```{r 19.6.2-4}
library(magrittr)
library(tidyverse)

mtcars %>% 
  show_missing() %>% 
  mutate(mpg=ifelse(mpg<20,NA,mpg)) %>% 
  show_missing()
```

## 19.7 Environment
Define a function 'f' that takes an argument 'x' and returns the sum of 'x' and 'y'
```{r 19.7-1}
f <- function(x){
  x+y
}
```
Demonstrate how changing the value of 'y' affects the result of calling function 'f'
```{r 19.7-2}
y <- 100
f(10)
y <- 1000
f(10)
```
Overload the '+' operator to behave differently based on a random condition
```{r 19.7-3}
`+` <- function(x, y) {
  if (runif(1) < 0.1) {
    sum(x, y)
  } else {
    sum(x, y) * 1.1
  }
}
table(replicate(1000, 1 + 2))
#> 
#>   3 3.3 
#> 100 900
rm(`+`)
```
# Chapter 20: Vectors 

### 20.1.1 PRerequisites 

```{r}
library(tidyverse)
```

## 20.2 Vector basics 
Determine the data type of different vectors
```{r}
typeof(letters)
typeof(1:10)
```

Determine the length of a list and display its contents
```{r}
x <- list("a","b",1:10)
length(x)
x
```
Demonstrate modulo operation and creation of logical vectors
```{r}
1:10 %% 3 ==0
c(T,T,F,NA)
```

### 20.3.2 Numeric
Integer and double vectors are known collectively as numeric vectors. In R, numbers are doubles by default. To make an integer, place an L after the number:
```{r 20.3.2-1}
typeof(1)
typeof(1L)
1.5
```
Demonstrate the behavior of floating point arithmetic
```{r 20.3.2-2}
x <- sqrt(2)^2
x
x-2

```
Demonstrate the behavior of division by zero
```{r 20.3.2-3}
c(-1,0,1)%/% 0
# [1] -Inf  NaN  Inf
```



### 20.3.3 Character
Determine the memory size of a string and a replicated string vector
```{r 20.3.3}
x <- "This is a reasonably long string."
pryr::object_size(x)

y <- rep(x,1000)
pryr::object_size(y)
```


### 20.3.4 Missing values
Note that each type of atomic vector has its own missing value:
```{r}
NA            # logical

NA_integer_   # integer

NA_real_      # double

NA_character_ # character

```
Calculate the number and proportion of elements in a vector greater than 10
```{r 20.4.1-1}
x <- sample(20,100,replace=T)
y <- x > 10
sum(y) # how many are greater than 10?
mean(y) # what proportion are greater than 10?
```
```{r 20.4.1-2}
if (length(x)){
}
```
Determine the data type of different vectors
```{r 20.4.1-3}
typeof(c(TRUE,1L))
typeof(c(1L,1.5))
typeof(c(1.5,"a"))
```
Generate random numeric or logical vectors
```{r}
sample(10)+100
runif(10)>0.5
```
Demonstrate vector arithmetic with vectors of different lengths
```{r}
1:10 +1:2
```

```{r}
1:10+1:3
```
Create a tibble with two columns, 'x' and 'y', with different lengths
```{r}
library(tidyverse)



tibble(
  x=1:4,
  y=rep(1:2,each=2)
)
```

#### 20.4.4 Naming vectors
All types of vectors can be named. You can name them during creatin with `c()`:
```{r}
c(x=1,y=2,z=4)
```

Or after the fact with `purr::set_names()`
```{r}
set_names(1:3,c("a","b","c"))
```

Named vectors are most useful for subsetting, described next.

### 20.4.5 Subsetting
Demonstrate subsetting vectors with integer indices
```{r}
x <- c("one","two","three","four","five")
x[c(3,2,5)]
```

By repeating a position, you can actually make a longer output than input:
```{r}
x[c(1,1,5,5,5,2)]
```

Negative values drop the elements at the specified positions:
```{r}
x[c(-1,-3,-5)]
```
The error message mentions subsetting with zero, which returns no values:
```{r}
x[0]
```
```{r}
library(tidyverse)
x <- c(10,3,NA,5,8,1)

# tibble test
x <- as.tibble(x,ncol=1)
names(x)="v1"
is.na(x)
x %>% filter(v1 == NA)

# all non-missing values of x
x <- c(10,3,NA,5,8,1)
x[!is.na(x)]

# all even (or missing) values of x
x[x %% 2==0]
```

3. If you have a named vector, you can subset it with a character vector:
```{r}
x <- c(abc=1, def=2,xyz=5)
x[c("xyz","def")]
```

## 20.5 Recursive vectors (lists)
Create a list with numeric elements
```{r 20.5-1}
x <- list(1,2,3)
x
```
Display the structure of lists with and without names
```{r 20.5-2}
str(x)
x_named <- list(a=1,b=2,c=3)
str(x_named)
```

Unlike atomic vectors, `list()` can contain a mix of objects:
```{r 20.5-3}
y <- list("a",1L,1.5,T)
str(y)
```

List can even contain other lists!
```{r}
z <- list(list(1,2),list(3,4))
str(z)
```

### 20.5.1 Visualizing lists 
```{r 20.5.1}
x1 <- list(c(1,2),c(3,4))
x2 <- list(list(1,2),list(3,4))
x3 <- list(1,list(2,list(3)))
x1
x2
x3
```
### 20.5.2 Subsetting
Create a list 'a' with named elements and demonstrate subsetting
```{r}
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
```

```{r 20.5.2-2}
str(a)
str(a[1:2])
str(a[4])
```
Demonstrate subsetting lists using double square brackets
```{r 20.5.2-3}
str(a[[1]])
str(a[[4]])
```
Access list elements by name using $ or [[ ]]
```{r}
a$a
a[["a"]]
```
## 20.6 Attributes
Demonstrate setting and retrieving attributes of vectors
```{r 20.6-1}
x <- 1:10
attr(x,"greeting")

attr(x,"greeting") <- "Hi!"
attr(x,"farewell") <- "Bye!"
attributes(x)
```
Demonstrate methods for class 'Date'
```{r}
as.Date
```
```{r 20.6-3}
methods("as.Date")
```
Retrieve specific methods for 'as.Date'
```{r}
getS3method("as.Date","default")
getS3method("as.Date","numeric")
```
### 20.7.1 Factors 
Demonstrate creating a factor and inspecting its attributes
```{r}
x <- factor(c("ab","cd","ab"),levels=c("ab","cd","ed"))
typeof(x)
attributes(x)
```

### 20.7.2 Dates and date-times
Dates in R are numeric vectors that represent the number of days since 1 January 1970.
```{r 20.7.2-1}
x <- as.Date("1971-01-01")
unclass(x)

typeof(x)
attributes(x)
```
Demonstrate creating and inspecting a date-time object
```{r 20.7.2-2}
x <- lubridate::ymd_hm("1970-01-01 01:00")
unclass(x)

typeof(x)
attributes(x)
```
Demonstrate setting and retrieving time zone for date-time object
```{r 20.7.2-3}
attr(x,"tzone") <- "US/Pacific"
x

attr(x,"tzone") <- "US/Eastern"
x
```

There is another type of date-times called POSIXIt. There are built on top of named lists:
```{r}
y <- as.POSIXlt(x)
typeof(y)
#> [1] "list"
attributes(y)
```

### 20.7.3 Tibbles
Tibbles are augmented lists: they have class “tbl_df” + “tbl” + “data.frame”, and `names` (column) and `row.names` attributes:
```{r 20.7.3-1}
tb <- tibble::tibble(x = 1:5, y = 5:1)
typeof(tb)
attributes(tb)

```
```{r 20.7.3-2}
df <- data.frame(x = 1:5, y = 5:1)
typeof(df)
attributes(df)
```
# Chapter 21: Iteration
### 21.1.1 Prerequisites
```{r}
library(tidyverse)
```

## 21.2 For loops
Imagine we have this simple tibble:
```{r}
df <- tibble(
  a=rnorm(10),
  b=rnorm(10),
  c=rnorm(10),
  d=rnorm(10)
)
```
Calculate the median for each column in a tibble 
```{r}
median(df$a)
median(df$b)
median(df$c)
median(df$d)
```
Calculate the median for each column in the data frame 'df' using a for loop
```{r}
df
output <- vector("double",ncol(df))
for (i in seq_along(df)){
  output[[i]] <- median(df[[i]])
}
output <- tibble(output)
```
Demonstrate the behavior of seq_along and length functions with an empty vector 'y'
```{r}
y <- vector("double", 0)
seq_along(y)
#> integer(0)
1:length(y)
#> [1] 1 0
```
### 21.3.1v Modifying an existing object
Sometimes, you want to use a for loop to modify an existing object. For example, remember our challenges from functions. We wanted to rescale every column in a data frame:
```{r}
library(tidyverse)

df <- tibble(
  a=rnorm(10),
  b=rnorm(10),
  c=rnorm(10),
  d=rnorm(10)
)

rescale01 <- function(x){
  rng <- range(x,na.rm=T)
  (x-rng[1])/(rng[2]-rng[1])
}

df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)

df
```

```{r}
for ( i in seq_along(df)){
  df[[i]] <- rescale01(df[[i]])
}
```

### 21.3.2 Looping patterns
```{r}
x
results <- vector("list",length(x))
names(results) <- names(x)
```

Demonstrate looping patterns using a for loop to iterate over a list 'x' and store results in a list 'results'
```{r}
for(i in seq_along(x)){
  name <- names(x)[[i]]
  value <- x[[i]]
}
```

### 21.3.3 Unknown output length
Create a vector 'output' with unknown length and store results from a for loop in it
```{r}
means <- c(0,1,2)

output <- double()
for (i in seq_along(means)){
  n <- sample(100,1)
  output <- c(output,rnorm(n,means[[i]]))
}
str(output)
output
```

Create a list 'out' with unknown length and store results from a for loop in it
```{r}
out <- vector("list",length(means))
for (i in seq_along(means)){
  n <- sample(100,1)
  out[[i]] <- rnorm(n,means[[i]])
}
str(out)
str(unlist(out))
```
### 21.3.4 Unknown sequence length


A while loop is also more general than a for loop, because you can rewrite any for loop as a while loop, but you can't rewrite every while loop as for loop:
```{r}
for (i in seq_along(x)) {
  # body
}

# Equivalent to
i <- 1
while (i <= length(x)) {
  # body
  i <- i + 1 
}

```

Herhow we could use a while loop to find how many tries it takes to get three heads in a row: 
```{r}
flip <- function() sample(c("T", "H"), 1)

flips <- 0
nheads <- 0

while (nheads < 3) {
  if (flip() == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0
  }
  flips <- flips + 1
}
flips
```
## 21.4 For loops vs. functionals
Compare for loop and functional approaches for calculating column means in a data frame
```{r}
df <- tibble(
  a=rnorm(10),
  b=rnorm(10),
  c=rnorm(10),
  d=rnorm(10)
)
```
Using for loop
```{r}
output <- vector("double",length(df))
for (i in seq_along(df)){
  output[[i]] <- mean(df[[i]])
}
output
```
Using functional approach with a custom function 'col_mean'
```{r}
col_mean <- function(df){
  output <- vector("double",length(df))
  for (i in seq_along(df)){
    output[i] <- mean(df[[i]])
  }
  output
}
```
Define a function 'col_median' to calculate the median for each column in the data frame 'df'
```{r}
col_median <- function(df){
  output <- vector("double",hh(df))
  for (i in seq_along(df)){
    output[i] <- median(df[[i]])
  }
  output
}

col_sd <- function(df){
  output <- vector("double",length(df))
  for (i in seq_along(df)){
    output[i] <- sd(df[[i]])
  }
  output
}

df
```
Define functions f1, f2, and f3 for calculating different powers of absolute deviation from the mean
```{r}
f1 <- function(x) abs(x-mean(x))^1
f2 <- function(x) abs(x-mean(x))^2
f3 <- function(x) abs(x-mean(x))^3
```
Define a function 'f' to calculate the absolute deviation from the mean raised to a given power 'i'
```{r}
f <- function(x,i) abs(x-mean(x))^i
```
Define a function 'col_summary' to apply a summary function 'fun' to each column of the data frame 'df'
```{r}
col_summary <- function(df, fun) {
  out <- vector("double", length(df))
  for (i in seq_along(df)) {
    out[i] <- fun(df[[i]])
  }
  out
}
col_summary(df, median)
col_summary(df, mean)
```
Demonstrate the use of 'map_dbl' from the 'purrr' package to apply a function to each column of the data frame 'df'
```{r}
library(purrr)
head(df)


# Reference - for loop()
output <- vector("double",length(df))
for (i in seq_along(df)){
  output[[i]] <- mean(df[[i]])
}
output

map_dbl(df,mean)
map_dbl(df,median)
map_dbl(df,sd)
```

```{r}
df %>% map_dbl(mean)
df %>% map_dbl(median)
df %>% map_dbl(sd)
```
Demonstrate the use of 'map_dbl' from the 'purrr' package with additional arguments
```{r}
map_dbl(df,mean,trim=0.5)
```
Demonstrate the use of 'map_int' from the 'purrr' package to apply a function that returns integers to each element of a list
```{r}
z <- list(x=1:3,y=4:5)
z

map_int(z,length)
```

### 21.5.1 Shortcuts 
Demonstrate the use of 'safely' from the 'purrr' package to create a safe version of a function

```{r}
safe_log <- safely(log)
str(safe_log(10))
str(safe_log("a"))
```
Demonstrate the use of 'map' from the 'purrr' package with 'safely' to apply a safe version of a function to each element of a list

```{r}
x <- list(1,10,"a")
y <- x %>% map(safely(log))
str(y)

```
Demonstrate the use of 'transpose' from the 'purrr' package to transpose a list of lists
```{r}
y <- x %>% transpose()
str(y)
```
Demonstrate the use of error handling with 'map_lgl' and 'is_null' from the 'purrr' package
```{r}
is_ok <- y$error %>% map_lgl(is_null)
x[!is_ok]
# y$result[is_ok] %>% flatten_dbl()
```

Purrr provides two other useful adverbs:
```{r}
x <- list(1,10,"a")
x %>% map_dbl(possibly(log,NA_real_))
```
Demonstrate the use of 'quietly' from the 'purrr' package to suppress errors and return results with warnings
```{r}
x <- list(1,-1)
x %>% map(quietly(log)) %>% str()
```


## 21.7 Mapping over multiple arguments 
Generate random numbers from normal distributions with different means using 'map' from the 'purrr' package
```{r}
mu <- list(5,10,-3)
mu %>% 
  map(rnorm,n=5) %>% 
  str()
```
Generate random numbers from normal distributions with different means and standard deviations using 'map2' from the 'purrr' package
```{r}
sigma <- list(1,5,10)
seq_along(mu) %>% 
  map(~rnorm(5,mu[[.]],sigma[[.]])) %>% 
  str()
```
Define a custom 'map2' function to apply a binary function to corresponding elements of two lists
```{r}
map2(mu,sigma,rnorm,n=5) %>% str()
```

```{r}
map2 <- function(x,y,f,...){
  out <- vector("list",length(x))
  for (i in seq_along(x)){
    out[[i]] <- f(x[[i]],y[[i]],...)
  }
  out
}
```
Apply a function to corresponding elements of multiple lists using 'pmap' from the 'purrr' package
```{r}

library(magrittr)
library(purrr)

n <- list(1,3,5)
args1 <- list(n,mu,sigma)
args1 %>% 
  pmap(rnorm) %>% 
  str()
```
Apply a function to corresponding elements of multiple lists with named parameters using 'pmap' from the 'purrr' package
```{r}
args2 <- list(mean=mu, sd=sigma,n=n)
args2 %>% 
  pmap(rnorm) %>% 
  str()
```
Apply a function to corresponding rows of a data frame using 'pmap' from the 'purrr' package with a tibble
```{r}
library(tidyverse)
parms <- tribble(
  ~mean,~sd,~n,
  5,1,1,
  10,5,3,
  -3,10,5
)

parms %>% 
  pmap(rnorm)
```
### 21.7.1 Involing different functions
Invoke different functions with different parameters using 'invoke_map' from the 'purrr' package
```{r}
f <- c("runif","rnorm","rpois")
param <- list(
  list(min=-1,max=1),
  list(sd=5),
  list(lambda=10)
)

f
param
```

To handle this case, you can use `invoke_map()`:
```{r}
invoke_map(f,param,n=5) %>% 
  str()
```
Invoke different functions with different parameters using 'pmap' from the 'purrr' package and a tibble
```{r}
sim <- tribble(
  ~f,      ~params,
  "runif", list(min = -1, max = 1),
  "rnorm", list(sd = 5),
  "rpois", list(lambda = 10)
)
sim %>% 
  mutate(sim = invoke_map(f, params, n = 10))
```

## 21.8 Walk
Perform side effects without returning a value for each element of a list using 'walk' from the 'purrr' package
```{r}
x <- list(1,"a",3)
x %>% 
  walk(print)
```
Perform side effects on each element of a list using 'walk' from the 'purrr' package, then save the results
```{r}
library(ggplot2)
plots <- mtcars %>% 
  split(.$cyl) %>% 
  map(~ggplot(., aes(mpg, wt)) + geom_point())
paths <- stringr::str_c(names(plots), ".pdf")

pwalk(list(paths, plots), ggsave, path = tempdir())
```
Retain or remove elements of a list based on a predicate function using 'keep' and 'discard' from the 'purrr' package
```{r}
iris %>% 
  keep(is.factor) %>% 
  str()

iris %>% 
  discard(is.factor) %>%
  str()
```



```{r}
library(tidyverse)
library(magrittr)


```

### 21.9.2 Reduce and accumulate
Iteratively combine elements of a list using a binary function with 'reduce' from the 'purrr' package
```{r}
dfs <- list(
  age=tibble(name="John",age=30),
  sex=tibble(name=c("John","Mary"),sex=c("M","F")),
  trt=tibble(name="Mary",treatment="A")
)

dfs %>% reduce(full_join)
```
Find the intersection of multiple vectors using 'reduce' from the 'purrr' package
```{r}
vs <- list(
  c(1,3,5,6,10),
  c(1,2,3,7,8,10),
  c(1,2,3,4,8,9,10)
)
vs %>% reduce(intersect)
```
Iteratively apply a function to elements of a list using 'accumulate' from the 'purrr' package
```{r}
x <- sample(10)
x
x %>% accumulate(`+`)
```

